Data Visualization Exercise
-1, How many rows are in penguins? How many columns?
## # A tibble: 344 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## 7 Adelie Torgersen 38.9 17.8 181 3625
## 8 Adelie Torgersen 39.2 19.6 195 4675
## 9 Adelie Torgersen 34.1 18.1 193 3475
## 10 Adelie Torgersen 42 20.2 190 4250
## # ℹ 334 more rows
## # ℹ 2 more variables: sex <fct>, year <int>
Penguins has 344 rows and 8 columns.
-2, What does the bill_depth_mm variable in the penguins data frame
describe? Read the help for ?penguins to find out.
#A: It is a number denoting bill depth (millimeters)
-3, Make a scatterplot of bill_depth_mm vs. bill_length_mm. That is,
make a scatterplot with bill_depth_mm on the y-axis and bill_length_mm
on the x-axis. Describe the relationship between these two
variables.
ggplot(
data = penguins,
mapping = aes(x = bill_length_mm, y = bill_depth_mm)
) +
geom_point()

ggplot(
data = penguins,
mapping = aes(x = bill_length_mm, y = bill_depth_mm)
) +
geom_point()
## Warning: Removed 2 rows containing missing values (`geom_point()`).

#A: The scatterplot does not suggest a strong relationship between bill_depth_mm vs. bill_length_mm, as bill_depth_mm data are randomly distributed over bill_length_mm.
-4, What happens if you make a scatterplot of species
vs. bill_depth_mm? What might be a better choice of geom?
ggplot( data = penguins, mapping = aes(x = species, y =
bill_depth_mm) ) + geom_point()
#A: Depth data of each sample is displaced over three categories of species.It might be better to combined species into the previous bill_depth_mm vs. bill_length_mm.
-5, Why does the following give an error and how would you fix
it?
ggplot(data = penguins) + geom_point()
#A: The code did not set the mapping of x-axis and y-axis. I would add "mapping=aes(...)" to include certain variable input and generate the graph.
-6, What does the na.rm argument do in geom_point()? What is the
default value of the argument? Create a scatterplot where you
successfully use this argument set to TRUE.
#"na.rm" will remove missing values in the target dataset(if coded as "NA").
ggplot(
data = penguins,
mapping = aes(x = bill_length_mm, y = bill_depth_mm)
) +
geom_point(na.rm=TRUE)

-7, Add the following caption to the plot you made in the previous
exercise: “Data come from the palmerpenguins package.” Hint: Take a look
at the documentation for labs().
#Using labs() to include these captions.
ggplot(
data = penguins,
mapping = aes(x = bill_length_mm, y = bill_depth_mm)
) + labs( title = "Body mass and flipper length",
subtitle = "Bill length and depth of Penguins",
x = "Bill length (mm)", y = "Bill depth(mm)")+
geom_point(na.rm=TRUE)

<<<<<<< HEAD -8, Recreate the following
visualization. What aesthetic should bill_depth_mm be mapped to? And
should it be mapped at the global level or at the geom level?
-8, Recreate the following visualization. What aesthetic should
bill_depth_mm be mapped to? And should it be mapped at the global level
or at the geom level?
ggplot(data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point(aes(color = bill_depth_mm)) +
geom_smooth()

#bill_depth_mm should be mapped to geom_point(aes(color=)), and at the geom level. (BUT WHY?)
-9. Run this code in your head and predict what the output will look
like. Then, run the code in R and check your predictions.
ggplot( data = penguins, mapping = aes(x = flipper_length_mm, y =
body_mass_g, color = island) ) + geom_point() + geom_smooth(se =
FALSE)
#A: It will generate a flipper_lenght_mm vs body_mass_g plot with three different colors on points to identify which island the sample comes from.
-10. Will these two graphs look different? Why/why not?
ggplot( data = penguins, mapping = aes(x = flipper_length_mm, y =
body_mass_g) ) + geom_point() + geom_smooth()
ggplot() + geom_point( data = penguins, mapping = aes(x =
flipper_length_mm, y = body_mass_g) ) + geom_smooth( data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g) )
#A: Yes, the two graphs will look as same. Because the input of variables are the same for both graphs. They are only different in the way of organizing data (the second one specifies the input in geom(), while the input in those two are the same as mapping.) (Is this a proper answer?)
2.43 Exercise
-1, Make a bar plot of species of penguins, where you assign species
to the y aesthetic. How is this plot different?
#The plot appears to be 'flipped' so the species are appearing on the y axis.
-2, How are the following two plots different? Which aesthetic, color
or fill, is more useful for changing the color of bars?
ggplot(penguins, aes(x = species)) + geom_bar(color = “red”)
ggplot(penguins, aes(x = species)) + geom_bar(fill = “red”)
#These two plots are different in the color of bars. The fill is more useful for changing the color of bars
-3, What does the bins argument in geom_histogram() do?
#Bins stands the number of "buckets" that data is cut into, automatically it is 30.
-4, Make a histogram of the carat variable in the diamonds dataset
that is available when you load the tidyverse package. Experiment with
different binwidths. What binwidth reveals the most interesting
patterns?
ggplot(diamonds, aes(x = carat)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(diamonds, aes(x = carat)) +
geom_histogram(bins = 15)

ggplot(diamonds, aes(x = carat)) +
geom_histogram(binwidth = 1.5)

#The binwidth of 1.5 shows the most interesting pattern with only 3 cuts to the data.
2.5.5 Exercise
-1, The mpg data frame that is bundled with the ggplot2 package
contains 234 observations collected by the US Environmental Protection
Agency on 38 car models. Which variables in mpg are categorical? Which
variables are numerical? (Hint: Type ?mpg to read the documentation for
the dataset.) How can you see this information when you run mpg?
str(mpg)
## tibble [234 × 11] (S3: tbl_df/tbl/data.frame)
## $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
## $ model : chr [1:234] "a4" "a4" "a4" "a4" ...
## $ displ : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
## $ year : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
## $ cyl : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
## $ trans : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
## $ drv : chr [1:234] "f" "f" "f" "f" ...
## $ cty : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
## $ hwy : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
## $ fl : chr [1:234] "p" "p" "p" "p" ...
## $ class : chr [1:234] "compact" "compact" "compact" "compact" ...
#manufacturer, model, year, and class are categorical variables
#displ, cyl, trans, drv, cty, hwy, and fl are numerical variables
#I can use the str(mpg) function, which displays the structure of the dataset, including variable names and data types.
-2, Make a scatterplot of hwy vs. displ using the mpg data frame.
Next, map a third, numerical variable to color, then size, then both
color and size, then shape. How do these aesthetics behave differently
for categorical vs. numerical variables?
ggplot(mpg, aes(x = displ, y = hwy, color = cyl)) +
geom_point()

ggplot(mpg, aes(x = displ, y = hwy, size = cyl)) +
geom_point()

ggplot(mpg, aes(x = displ, y = hwy, color = cyl, size = cyl)) +
geom_point()

ggplot(mpg, aes(x = displ, y = hwy, shape = drv)) +
geom_point()

#When numerical variables are mapped to color or size aesthetics, the values of the numerical variable determine the color or size of points. Higher numerical values often correspond to lighter colors or larger sizes. When categorical variables are mapped to color, each category is assigned a unique color. This aids in distinguishing between different categories in the visualization.When categorical variables are mapped to shape, each category is represented by a distinct shape. This approach is valuable for differentiation.
-3, In the scatterplot of hwy vs. displ, what happens if you map a
third variable to linewidth?
#It would not have effect, since the linewidth aesthetic is typically used for specifying the width of lines in line plots, not for points in scatterplots.
-4, What happens if you map the same variable to multiple
aesthetics?
#It could generate different symbols, size and changes of color depending on the certain condition.
<<<<<<< HEAD -5, Make a scatterplot of
bill_depth_mm vs. bill_length_mm and color the points by species. What
does adding coloring by species reveal about the relationship between
these two variables? What about faceting by species?
ggplot(data = penguins,
mapping = aes(x = bill_length_mm , y = bill_depth_mm))+
geom_point(aes(color = species))

#Each specie has a distinct cluster that different from other two species in terms of bill_depth_mm vs. bill_length_mm.
-6,Why does the following yield two separate legends? How would you
fix it to combine the two legends?
ggplot( data = penguins, mapping = aes( x = bill_length_mm, y =
bill_depth_mm, color = species, shape = species ) ) + geom_point() +
labs(color = “Species”)
<<<<<<< HEAD ggplot( data = penguins, mapping =
aes( x = bill_length_mm, y = bill_depth_mm, color = species, shape =
species ) ) + geom_point() + labs(color = “Species”)
#This is because the labs(color = "Species") only generates the legend "Species" in color while missing the shape.
#FIX by combine legends of shape and color
ggplot(data = penguins,
mapping = aes(
x = bill_length_mm, y = bill_depth_mm,
color = species, shape = species )) +
geom_point() +
scale_color_discrete(name = "Species") +
scale_shape_discrete(name = "Species")

-7, Create the two following stacked bar plots. Which question can
you answer with the first one? Which question can you answer with the
second one?
ggplot(penguins, aes(x = island, fill = species)) +
geom_bar(position = "fill")

ggplot(penguins, aes(x = species, fill = island)) +
geom_bar(position = "fill")

#First one: what is the distribution of penguin species on different islands?
#Second one: what is the distribution of penguin islands for each species?
2.6.1 Exercise
-1,Run the following lines of code. Which of the two plots is saved
as mpg-plot.png? Why?
ggplot(mpg, aes(x = class)) + geom_bar() ggplot(mpg, aes(x = cty, y =
hwy)) + geom_point() ggsave(“mpg-plot.png”)
#The second one is saved for it is the latest one.
2,What do you need to change in the code above to save the plot as a
PDF instead of a PNG? How could you find out what types of image files
would work in ggsave()?
-To save as a pdf the code needs to be changed from .png
to .pdfat the end of file names.
#3.5 Exercise
-1, Why does this code not work? my_variable <- 10 my_varıable -
> Error in eval(expr, envir, enclos): object ‘my_varıable’ not
found
Look carefully! (This may seem like an exercise in pointlessness, but
training your brain to notice even the tiniest difference will pay off
when programming.)
#There is a typing error in "my_varıable"
-2,Tweak each of the following R commands so that they run
correctly:
libary(todyverse)
ggplot(dTA = mpg) + geom_point(maping = aes(x = displ y = hwy)) +
geom_smooth(method = “lm)
library(tidyverse)
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))

-3,Press Option + Shift + K / Alt + Shift + K. What happens? How can
you get to the same place using the menus?
#Keyboard Shortcut Quick Reference pops up. Using menus to select the tools -> Keyboard shortcuts help
-4, Let’s revisit an exercise from the Section 2.6. Run the following
lines of code. Which of the two plots is saved as mpg-plot.png? Why?
my_bar_plot <- ggplot(mpg, aes(x = class)) + geom_bar()
my_scatter_plot <- ggplot(mpg, aes(x = cty, y = hwy)) + geom_point()
ggsave(filename = “mpg-plot.png”, plot = my_bar_plot)
#The first one, since the ggsave selects the specific name of plot.
4.2.5, Rows Exercise
library(nycflights13)
-1, In a single pipeline for each condition, find all flights that
meet the condition:
library(nycflights13)
#Had an arrival delay of two or more hours
filter(flights, arr_delay >= 120)
## # A tibble: 10,200 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 811 630 101 1047 830
## 2 2013 1 1 848 1835 853 1001 1950
## 3 2013 1 1 957 733 144 1056 853
## 4 2013 1 1 1114 900 134 1447 1222
## 5 2013 1 1 1505 1310 115 1638 1431
## 6 2013 1 1 1525 1340 105 1831 1626
## 7 2013 1 1 1549 1445 64 1912 1656
## 8 2013 1 1 1558 1359 119 1718 1515
## 9 2013 1 1 1732 1630 62 2028 1825
## 10 2013 1 1 1803 1620 103 2008 1750
## # ℹ 10,190 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
#Flew to Houston (IAH or HOU)
filter(flights, dest == "IAH" | dest == "HOU")
## # A tibble: 9,313 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 623 627 -4 933 932
## 4 2013 1 1 728 732 -4 1041 1038
## 5 2013 1 1 739 739 0 1104 1038
## 6 2013 1 1 908 908 0 1228 1219
## 7 2013 1 1 1028 1026 2 1350 1339
## 8 2013 1 1 1044 1045 -1 1352 1351
## 9 2013 1 1 1114 900 134 1447 1222
## 10 2013 1 1 1205 1200 5 1503 1505
## # ℹ 9,303 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
#Were operated by United, American, or Delta
filter(flights, carrier %in% c("AA", "DL", "UA"))
## # A tibble: 139,504 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 554 600 -6 812 837
## 5 2013 1 1 554 558 -4 740 728
## 6 2013 1 1 558 600 -2 753 745
## 7 2013 1 1 558 600 -2 924 917
## 8 2013 1 1 558 600 -2 923 937
## 9 2013 1 1 559 600 -1 941 910
## 10 2013 1 1 559 600 -1 854 902
## # ℹ 139,494 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
#Departed in summer (July, August, and September)
filter(flights, month >= 7, month <= 9)
## # A tibble: 86,326 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 7 1 1 2029 212 236 2359
## 2 2013 7 1 2 2359 3 344 344
## 3 2013 7 1 29 2245 104 151 1
## 4 2013 7 1 43 2130 193 322 14
## 5 2013 7 1 44 2150 174 300 100
## 6 2013 7 1 46 2051 235 304 2358
## 7 2013 7 1 48 2001 287 308 2305
## 8 2013 7 1 58 2155 183 335 43
## 9 2013 7 1 100 2146 194 327 30
## 10 2013 7 1 100 2245 135 337 135
## # ℹ 86,316 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
#Arrived more than two hours late, but didn’t leave late
filter(flights, arr_delay > 120, dep_delay <= 0)
## # A tibble: 29 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 27 1419 1420 -1 1754 1550
## 2 2013 10 7 1350 1350 0 1736 1526
## 3 2013 10 7 1357 1359 -2 1858 1654
## 4 2013 10 16 657 700 -3 1258 1056
## 5 2013 11 1 658 700 -2 1329 1015
## 6 2013 3 18 1844 1847 -3 39 2219
## 7 2013 4 17 1635 1640 -5 2049 1845
## 8 2013 4 18 558 600 -2 1149 850
## 9 2013 4 18 655 700 -5 1213 950
## 10 2013 5 22 1827 1830 -3 2217 2010
## # ℹ 19 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
#Were delayed by at least an hour, but made up over 30 minutes in flight
filter(flights, dep_delay >= 60, dep_delay - arr_delay > 30)
## # A tibble: 1,844 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 2205 1720 285 46 2040
## 2 2013 1 1 2326 2130 116 131 18
## 3 2013 1 3 1503 1221 162 1803 1555
## 4 2013 1 3 1839 1700 99 2056 1950
## 5 2013 1 3 1850 1745 65 2148 2120
## 6 2013 1 3 1941 1759 102 2246 2139
## 7 2013 1 3 1950 1845 65 2228 2227
## 8 2013 1 3 2015 1915 60 2135 2111
## 9 2013 1 3 2257 2000 177 45 2224
## 10 2013 1 4 1917 1700 137 2135 1950
## # ℹ 1,834 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
-2, Sort flights to find the flights with longest departure delays.
Find the flights that left earliest in the morning.
arrange(flights, desc(dep_delay))
## # A tibble: 336,776 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 9 641 900 1301 1242 1530
## 2 2013 6 15 1432 1935 1137 1607 2120
## 3 2013 1 10 1121 1635 1126 1239 1810
## 4 2013 9 20 1139 1845 1014 1457 2210
## 5 2013 7 22 845 1600 1005 1044 1815
## 6 2013 4 10 1100 1900 960 1342 2211
## 7 2013 3 17 2321 810 911 135 1020
## 8 2013 6 27 959 1900 899 1236 2226
## 9 2013 7 22 2257 759 898 121 1026
## 10 2013 12 5 756 1700 896 1058 2020
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
arrange(flights, dep_delay)
## # A tibble: 336,776 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 12 7 2040 2123 -43 40 2352
## 2 2013 2 3 2022 2055 -33 2240 2338
## 3 2013 11 10 1408 1440 -32 1549 1559
## 4 2013 1 11 1900 1930 -30 2233 2243
## 5 2013 1 29 1703 1730 -27 1947 1957
## 6 2013 8 9 729 755 -26 1002 955
## 7 2013 10 23 1907 1932 -25 2143 2143
## 8 2013 3 30 2030 2055 -25 2213 2250
## 9 2013 3 2 1431 1455 -24 1601 1631
## 10 2013 5 5 934 958 -24 1225 1309
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
#The most delayed flight was HA 51, JFK to HNL, and Flight B6 97 (JFK to DEN) departed 43 minutes early.
-3, Sort flights to find the fastest flights. (Hint: Try including a
math calculation inside of your function.)
head(arrange(flights, desc(distance / air_time)))
## # A tibble: 6 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 5 25 1709 1700 9 1923 1937
## 2 2013 7 2 1558 1513 45 1745 1719
## 3 2013 5 13 2040 2025 15 2225 2226
## 4 2013 3 23 1914 1910 4 2045 2043
## 5 2013 1 12 1559 1600 -1 1849 1917
## 6 2013 11 17 650 655 -5 1059 1150
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
#DL1499 is the fastest flight in terms of speed.
-4,Was there a flight on every day of 2013?
flights %>%
filter(year == 2013) %>%
distinct(month, day)
## # A tibble: 365 × 2
## month day
## <int> <int>
## 1 1 1
## 2 1 2
## 3 1 3
## 4 1 4
## 5 1 5
## 6 1 6
## 7 1 7
## 8 1 8
## 9 1 9
## 10 1 10
## # ℹ 355 more rows
#YES, every day having a flight.
-5,Which flights traveled the farthest distance? Which traveled the
least distance?
flights %>%
arrange(desc(distance))
## # A tibble: 336,776 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 857 900 -3 1516 1530
## 2 2013 1 2 909 900 9 1525 1530
## 3 2013 1 3 914 900 14 1504 1530
## 4 2013 1 4 900 900 0 1516 1530
## 5 2013 1 5 858 900 -2 1519 1530
## 6 2013 1 6 1019 900 79 1558 1530
## 7 2013 1 7 1042 900 102 1620 1530
## 8 2013 1 8 901 900 1 1504 1530
## 9 2013 1 9 641 900 1301 1242 1530
## 10 2013 1 10 859 900 -1 1449 1530
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
flights %>%
arrange(distance)
## # A tibble: 336,776 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 7 27 NA 106 NA NA 245
## 2 2013 1 3 2127 2129 -2 2222 2224
## 3 2013 1 4 1240 1200 40 1333 1306
## 4 2013 1 4 1829 1615 134 1937 1721
## 5 2013 1 4 2128 2129 -1 2218 2224
## 6 2013 1 5 1155 1200 -5 1241 1306
## 7 2013 1 6 2125 2129 -4 2224 2224
## 8 2013 1 7 2124 2129 -5 2212 2224
## 9 2013 1 8 2127 2130 -3 2304 2225
## 10 2013 1 9 2126 2129 -3 2217 2224
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
#HA51 travels the farthest distance, and US1632 has the shortest distance.
-6,Does it matter what order you used filter() and arrange() if
you’re using both? Why/why not? Think about the results and how much
work the functions would have to do.
#The order will not affect the result when using both filter() and arrange(), since arrange() only arranging the data instead of filtering it
<<<<<<< HEAD
## 4.35 Exercise
4.35 Exercise
3fa362b4b938c04a07e346096fb3a6c6b8adb433
-1, Compare dep_time, sched_dep_time, and dep_delay. How would you
expect those three numbers to be related?
#I will expect: dep_delay = dep_time - sched_dep_time
-2, Brainstorm as many ways as possible to select dep_time,
dep_delay, arr_time, and arr_delay from flights.
flights %>%
select(dep_time, dep_delay, arr_time, arr_delay)
## # A tibble: 336,776 × 4
## dep_time dep_delay arr_time arr_delay
## <int> <dbl> <int> <dbl>
## 1 517 2 830 11
## 2 533 4 850 20
## 3 542 2 923 33
## 4 544 -1 1004 -18
## 5 554 -6 812 -25
## 6 554 -4 740 12
## 7 555 -5 913 19
## 8 557 -3 709 -14
## 9 557 -3 838 -8
## 10 558 -2 753 8
## # ℹ 336,766 more rows
flights %>%
select(starts_with("dep"), starts_with("arr"))
## # A tibble: 336,776 × 4
## dep_time dep_delay arr_time arr_delay
## <int> <dbl> <int> <dbl>
## 1 517 2 830 11
## 2 533 4 850 20
## 3 542 2 923 33
## 4 544 -1 1004 -18
## 5 554 -6 812 -25
## 6 554 -4 740 12
## 7 555 -5 913 19
## 8 557 -3 709 -14
## 9 557 -3 838 -8
## 10 558 -2 753 8
## # ℹ 336,766 more rows
flights %>%
select(c(dep_time, dep_delay, arr_time, arr_delay))
## # A tibble: 336,776 × 4
## dep_time dep_delay arr_time arr_delay
## <int> <dbl> <int> <dbl>
## 1 517 2 830 11
## 2 533 4 850 20
## 3 542 2 923 33
## 4 544 -1 1004 -18
## 5 554 -6 812 -25
## 6 554 -4 740 12
## 7 555 -5 913 19
## 8 557 -3 709 -14
## 9 557 -3 838 -8
## 10 558 -2 753 8
## # ℹ 336,766 more rows
-3, What happens if you specify the name of the same variable
multiple times in a select() call?
flights %>%
select(dep_time, dep_time, dep_delay, arr_time, arr_delay)
## # A tibble: 336,776 × 4
## dep_time dep_delay arr_time arr_delay
## <int> <dbl> <int> <dbl>
## 1 517 2 830 11
## 2 533 4 850 20
## 3 542 2 923 33
## 4 544 -1 1004 -18
## 5 554 -6 812 -25
## 6 554 -4 740 12
## 7 555 -5 913 19
## 8 557 -3 709 -14
## 9 557 -3 838 -8
## 10 558 -2 753 8
## # ℹ 336,766 more rows
#Nothing happened
-4, What does the any_of() function do? Why might it be helpful in
conjunction with this vector?
#It returns all the variables you ask for, and this is helpful in finding out certain column in the dataset, as following
variables <- c("year", "month", "day", "dep_delay", "arr_delay")
flights %>%
select(any_of(variables))
## # A tibble: 336,776 × 5
## year month day dep_delay arr_delay
## <int> <int> <int> <dbl> <dbl>
## 1 2013 1 1 2 11
## 2 2013 1 1 4 20
## 3 2013 1 1 2 33
## 4 2013 1 1 -1 -18
## 5 2013 1 1 -6 -25
## 6 2013 1 1 -4 12
## 7 2013 1 1 -5 19
## 8 2013 1 1 -3 -14
## 9 2013 1 1 -3 -8
## 10 2013 1 1 -2 8
## # ℹ 336,766 more rows
-5, Does the result of running the following code surprise you? How
do the select helpers deal with upper and lower case by default? How can
you change that default?
flights |> select(contains(“TIME”))
#Any column name including TIME is presented in results, which is not accurate. I can change it as following
select(flights, contains("TIME", ignore.case = FALSE))
## # A tibble: 336,776 × 0
-6,Rename air_time to air_time_min to indicate units of measurement
and move it to the beginning of the data frame.
flights |>
rename(air_time_min = air_time)
## # A tibble: 336,776 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time_min <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
flights |>
relocate(air_time)
## # A tibble: 336,776 × 19
## air_time year month day dep_time sched_dep_time dep_delay arr_time
## <dbl> <int> <int> <int> <int> <int> <dbl> <int>
## 1 227 2013 1 1 517 515 2 830
## 2 227 2013 1 1 533 529 4 850
## 3 160 2013 1 1 542 540 2 923
## 4 183 2013 1 1 544 545 -1 1004
## 5 116 2013 1 1 554 600 -6 812
## 6 150 2013 1 1 554 558 -4 740
## 7 158 2013 1 1 555 600 -5 913
## 8 53 2013 1 1 557 600 -3 709
## 9 140 2013 1 1 557 600 -3 838
## 10 138 2013 1 1 558 600 -2 753
## # ℹ 336,766 more rows
## # ℹ 11 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
-7, Why doesn’t the following work, and what does the error mean?
flights |> select(tailnum) |> arrange(arr_delay) > Error in
arrange(): > ℹ In argument:
..1 = arr_delay. > Caused by error: > ! object
‘arr_delay’ not found
flights |>
select(tailnum, arr_delay) |>
arrange(arr_delay)
## # A tibble: 336,776 × 2
## tailnum arr_delay
## <chr> <dbl>
## 1 N843VA -86
## 2 N840VA -79
## 3 N851UA -75
## 4 N3KCAA -75
## 5 N551AS -74
## 6 N24212 -73
## 7 N3760C -71
## 8 N806UA -71
## 9 N805JB -71
## 10 N855VA -70
## # ℹ 336,766 more rows
#The select() does not include the arr_delay.
<<<<<<< HEAD
4.5.7 Exercise.
1, Which carrier has the worst average delays? Challenge: can you
disentangle the effects of bad airports vs. bad carriers? Why/why not?
(Hint: think about flights |> group_by(carrier, dest) |>
summarize(n()))
4.5.7 Exercise.
1, Which carrier has the worst average delays? Challenge: can you
disentangle the effects of bad airports vs. bad carriers? Why/why not?
(Hint: think about flights |> group_by(carrier, dest) |>
summarize(n()))
3fa362b4b938c04a07e346096fb3a6c6b8adb433
flights %>%
group_by(carrier) %>%
summarise(arr_delay = mean(arr_delay, na.rm = TRUE)) %>%
arrange(desc(arr_delay))
## # A tibble: 16 × 2
## carrier arr_delay
## <chr> <dbl>
## 1 F9 21.9
## 2 FL 20.1
## 3 EV 15.8
## 4 YV 15.6
## 5 OO 11.9
## 6 MQ 10.8
## 7 WN 9.65
## 8 B6 9.46
## 9 9E 7.38
## 10 UA 3.56
## 11 US 2.13
## 12 VX 1.76
## 13 DL 1.64
## 14 AA 0.364
## 15 HA -6.92
## 16 AS -9.93
#F9 has the worst average delays (F9 is Frontier Airline)
-2, Find the flights that are most delayed upon departure from each
destination.
-3, How do delays vary over the course of the day. Illustrate your
answer with a plot.
-4, What happens if you supply a negative n to slice_min() and
friends?
-5, Explain what count() does in terms of the dplyr verbs you just
learned. What does the sort argument to count() do?
-6, Suppose we have the following tiny data frame:
df <- tibble( x = 1:5, y = c(“a”, “b”, “a”, “a”, “b”), z = c(“K”,
“K”, “L”, “L”, “K”) )
Write down what you think the output will look like, then check if
you were correct, and describe what group_by() does.
df |> group_by(y)
Write down what you think the output will look like, then check if
you were correct, and describe what arrange() does. Also comment on how
it’s different from the group_by() in part (a)?
df |> arrange(y)
Write down what you think the output will look like, then check if
you were correct, and describe what the pipeline does.
df |> group_by(y) |> summarize(mean_x = mean(x))
Write down what you think the output will look like, then check if
you were correct, and describe what the pipeline does. Then, comment on
what the message says.
df |> group_by(y, z) |> summarize(mean_x = mean(x))
Write down what you think the output will look like, then check if
you were correct, and describe what the pipeline does. How is the output
different from the one in part (d).
df |> group_by(y, z) |> summarize(mean_x = mean(x), .groups =
“drop”)
Write down what you think the outputs will look like, then check if
you were correct, and describe what each pipeline does. How are the
outputs of the two pipelines different?
df |> group_by(y, z) |> summarize(mean_x = mean(x))
df |> group_by(y, z) |> mutate(mean_x = mean(x))
5.6 Exercise
-1,Restyle the following pipelines following the guidelines
above.
flights|>filter(dest==“IAH”)|>group_by(year,month,day)|>summarize(n=n(),
delay=mean(arr_delay,na.rm=TRUE))|>filter(n>10)
flights|>filter(carrier==“UA”,dest%in%c(“IAH”,“HOU”),sched_dep_time>
0900,sched_arr_time<2000)|>group_by(flight)|>summarize(delay=mean(
arr_delay,na.rm=TRUE),cancelled=sum(is.na(arr_delay)),n=n())|>filter(n>10)
#Repair
flights|>
filter(dest=="IAH")|>
group_by(year,month,day)|>
summarize(n=n(),
delay=mean(arr_delay,na.rm=TRUE))|>
filter(n>10)
## `summarise()` has grouped output by 'year', 'month'. You can override using the
## `.groups` argument.
## # A tibble: 365 × 5
## # Groups: year, month [12]
## year month day n delay
## <int> <int> <int> <int> <dbl>
## 1 2013 1 1 20 17.8
## 2 2013 1 2 20 7
## 3 2013 1 3 19 18.3
## 4 2013 1 4 20 -3.2
## 5 2013 1 5 13 20.2
## 6 2013 1 6 18 9.28
## 7 2013 1 7 19 -7.74
## 8 2013 1 8 19 7.79
## 9 2013 1 9 19 18.1
## 10 2013 1 10 19 6.68
## # ℹ 355 more rows
flights|>
filter(carrier=="UA",dest%in%c("IAH","HOU"),sched_dep_time>
0900,sched_arr_time<2000)|>
group_by(flight)|>
summarize(delay=mean(
arr_delay,na.rm=TRUE),cancelled=sum(is.na(arr_delay)),n=n()
)|>
filter(n>10)
## # A tibble: 74 × 4
## flight delay cancelled n
## <int> <dbl> <int> <int>
## 1 53 12.5 2 18
## 2 112 14.1 0 14
## 3 205 -1.71 0 14
## 4 235 -5.36 0 14
## 5 255 -9.47 0 15
## 6 268 38.6 1 15
## 7 292 6.57 0 21
## 8 318 10.7 1 20
## 9 337 20.1 2 21
## 10 370 17.5 0 11
## # ℹ 64 more rows
6.2.1
-1, For each of the sample tables, describe what each observation and
each column represents.
#Table 1 includes columns: country, year, cases, population with each observation on a row and value of a variable for the observation.
#Table 2 has columns: country, year, type, count. Type is a Character variable to indicate if a country is cases or population.
#Table 3 has columns: country, year, and rate. The rate is an emerged column of cases and population from table 1.
#Table 4 is split across two tables. The first one gives the cases with the column variables being years, while the second one gives population instead of cases.
-2, Sketch out the process you’d use to calculate the rate for table2
and table3. You will need to perform four operations:
Extract the number of TB cases per country per year. Extract the
matching population per country per year. Divide cases by population,
and multiply by 10000. Store back in the appropriate place. You haven’t
yet learned all the functions you’d need to actually perform these
operations, but you should still be able to think through the
transformations you’d need.
#Extract the number of TB cases per country per year.
table2_cases <- table2 %>%
filter(type == "cases")
#Extract the matching population per country per year
table2_pop <- table2 %>%
filter(type == "population")
#Divide cases by population, and multiply by 10000
table2_com <- tibble(
country = table2_cases$country,
year = table2_cases$year,
cases = table2_cases$count,
population = table2_pop$count
)
#Store back in the appropriate place.
table2_com <- table2_com %>%
mutate(rate = (cases / population) * 10000)
-Notes for new verbs: pivot_longer(), from row to
column/parse_number()/names_sep =, delete certain character/distinct
()/pivot_wider(), from column to row/names_from = “some
column/row”/id_cols = starts_with (“select specific values”)
-Difference between tibble() and tribble ()?
8.2.4
-What function would you use to read a file where fields were
separated with “|”?
#We could use "read_delim" to read files containing delimiter
-Apart from file, skip, and comment, what other arguments do
read_csv() and read_tsv() have in common?
#Common: col_names, col_types, col_select, id, locale, na, trim_ws, quoted_na, quote, comment, skip, n_max, guess_max, progress, name_repair, num_threads, show_col_types, skip_empty_rows, lazy
-What are the most important arguments to read_fwf()?
#I would say "fwf_width" and "fwf_positions" are most important, as they specify fields which defines vectors for extract data.
-Sometimes strings in a CSV file contain commas. To prevent them from
causing problems, they need to be surrounded by a quoting character,
like ” or ’. By default, read_csv() assumes that the quoting character
will be “. To read the following text into a data frame, what argument
to read_csv() do you need to specify?
“x,y,‘a,b’”
#Probably use "quoted_na = TRUE" or "quote = "
read_csv("x,y\n1,'a,b'", quote = "'")
## Rows: 1 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): y
## dbl (1): x
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 1 × 2
## x y
## <dbl> <chr>
## 1 1 a,b
-Identify what is wrong with each of the following inline CSV files.
What happens when you run the code?
read_csv(“a,b,2,3,5,6”) read_csv(“a,b,c,2,2,3,4”) read_csv(“a,b”1”)
read_csv(“a,b,2,b”) read_csv(“a;b;3”)
read_csv("a,b\n1,2,3\n4,5,6")
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
## Rows: 2 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (1): a
## num (1): b
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 2 × 2
## a b
## <dbl> <dbl>
## 1 1 23
## 2 4 56
#Prasing issues in column specification. Second row has two values, while the third row has three values.
read_csv("a,b,c\n1,2\n1,2,3,4")
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
## Rows: 2 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (2): a, b
## num (1): c
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 2 × 3
## a b c
## <dbl> <dbl> <dbl>
## 1 1 2 NA
## 2 1 2 34
#There are three header columns in the data frame, while the following row includes four values, which does not match the number of colums.
read_csv("a,b\n\"1")
## Rows: 0 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): a, b
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 0 × 2
## # ℹ 2 variables: a <chr>, b <chr>
#The dataset includes two header columns but only specifies one value in the firts row.
read_csv("a,b\n1,2\na,b")
## Rows: 2 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): a, b
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 2 × 2
## a b
## <chr> <chr>
## 1 1 2
## 2 a b
read_csv("a;b\n1;3")
## Rows: 1 Columns: 1
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): a;b
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 1 × 1
## `a;b`
## <chr>
## 1 1;3
#For dataset includes ";", should use read_csv2()
-Practice referring to non-syntactic names in the following data
frame by:
Extracting the variable called 1. Plotting a scatterplot of 1 vs. 2.
Creating a new column called 3, which is 2 divided by 1. Renaming the
columns to one, two, and three. annoying <- tibble( 1 =
1:10, 2 = 1 * 2 +
rnorm(length(1)) )
annoying <- tibble(
`1` = 1:10,
`2` = `1` * 2 + rnorm(length(`1`))
)
annoying %>%
select('1' = 1, '2' )
## # A tibble: 10 × 2
## `1` `2`
## <int> <dbl>
## 1 1 4.43
## 2 2 2.97
## 3 3 7.25
## 4 4 9.31
## 5 5 11.5
## 6 6 10.6
## 7 7 13.5
## 8 8 15.3
## 9 9 20.1
## 10 10 20.0
9 Workflow: getting help
9.1 Google is your friend
- It is helpful to use Google and Stack Overflow to solve code
errors.
9.2 Making a reprex
- Make code reproducible by reprex() to format default output in
github. Use dput() to generate the R code needed to recreate it
10 Layers
10.2 Aesthetic mappings
ggplot2 only use six shapes at a time.
Mapping an unordered discrete (categorical) variable (class) to
an ordered aesthetic (size or alpha) is generally not a good idea
because it implies a ranking that does not in fact exist.
You can customize size, shape, and color by inputing different
numbers (for size and shape) and color names (for color).
10.2.1 Exercises
1, Create a scatterplot of hwy vs. displ where the points are pink
filled in triangles.
library(tidyverse)
ggplot(mpg, aes(x = hwy, y = displ)) +
geom_point(color = "pink", shape = 17)
2. Why did the following code not result in a plot with blue points?
- A: The code failed to add aesthetic mapping to the “color =”
- What does the stroke aesthetic do? What shapes does it work with?
(Hint: use ?geom_point)
- A: Stroke modifies the width of the border to assist coloring the
inside and outside differently. It works with shapes having a
border.
- What happens if you map an aesthetic to something other than a
variable name, like aes(color = displ < 5)? Note, you’ll also need to
specify x and y.
ggplot(mpg, aes(x = hwy, y = displ,color = displ < 5))+geom_point()

- A: It generates a logical variable and default set different colors
for points with dipsl <5 and >= 5.
10.3 Geometric objects
-geom_smooth() separates the graph into various lines based on a
category variable. Other types of geom including geom_histogram(),
geom_density(), geom_boxplot(), and more (https://ggplot2.tidyverse.org/reference.) .
10.3.1 Exercises
- What geom would you use to draw a line chart? A boxplot? A
histogram? An area chart?
- A: Shouldn’t we use geom_line() for a line chart? None of the three
mentioned in question.
- Earlier in this chapter we used show.legend without explaining
it:
ggplot(mpg, aes(x = displ, y = hwy)) + geom_smooth(aes(color =
drv))
What does show.legend = FALSE do here? What happens if you remove it?
Why do you think we used it earlier?
- A: “show.legend = FALSE” will remove the notification of dry with
each color in the right. Removing it will have the color notification
pops up in the right. I think we used it earlier for clarifying which
points of different color belongs to different drv.
- What does the se argument to geom_smooth() do?
- A: Display confidence interval around smooth? (TRUE by default, see
level to control.)
- Recreate the R code necessary to generate the following graphs. Note
that wherever a categorical variable is used in the plot, it’s drv.
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

ggplot(mpg, aes(x = displ, y = hwy)) +
geom_smooth(mapping = aes(group = drv), se = FALSE) +
geom_point()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

ggplot(mpg, aes(x = displ, y = hwy, colour = drv)) +
geom_point() +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(aes(colour = drv)) +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

ggplot(mpg, aes(x = displ, y = hwy, colour = drv, linetype = drv)) +
geom_point() +
geom_smooth( se = FALSE)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

ggplot(mpg, aes(x = displ, y = hwy, colour = drv)) +
geom_point(size = 4, color = "white") +
geom_point()

10.4 Facets
- facet_wrap() splits a plot into subplots. Also facet_grid(), which
is a double sided formula: rows~cols. We add scales to the
facet_grid().
10.4.1 Exercies
1 What happens if you facet on a continuous variable? - It will
return multiple subplots. Each subplot contains a certain value of the
continuous variable.
- What do the empty cells in the plot above with facet_grid(drv ~ cyl)
mean? Run the following code. How do they relate to the resulting
plot?
ggplot(mpg) +
geom_point(aes(x = drv, y = cyl))
- Empty cells in the plot above stands for no observations for the
formula drv ~ cyl in the data set. The plot generated by codes in
question plot the combination of drv and cyl. These points are not
displayed in the previous plot.
3.What plots does the following code make? What does . do?
ggplot(mpg) +
geom_point(aes(x = displ, y = hwy)) +
facet_grid(drv ~ .)

ggplot(mpg) +
geom_point(aes(x = displ, y = hwy)) +
facet_grid(. ~ cyl)
- The ‘,’ removes the dimension of facet_grid(). The first plot contains
values of drv on y-axis, while the second places values of cyl on
x-axis.
4.Take the first faceted plot in this section:
ggplot(mpg) +
geom_point(aes(x = displ, y = hwy)) +
facet_wrap(~ class, nrow = 2)
What are the advantages to using faceting instead of the color
aesthetic? What are the disadvantages? How might the balance change if
you had a larger dataset? - Advantage: Specify each categories of
variables. Disadvantage: Lack of color for clear identification?
5.Read ?facet_wrap. What does nrow do? What does ncol do? What other
options control the layout of the individual panels? Why doesn’t
facet_grid() have nrow and ncol arguments? - nrow and nocl determines
the number of rows and columns correspondingly in facet_wrap(). They are
not used in facet_grid(), as facet_grid() does not require specifying
number of rows and columns.
- Which of the following plots makes it easier to compare engine size
(displ) across cars with different drive trains? What does this say
about when to place a faceting variable across rows or columns?
ggplot(mpg, aes(x = displ)) +
geom_histogram() +
facet_grid(drv ~ .)

ggplot(mpg, aes(x = displ)) +
geom_histogram() +
facet_grid(. ~ drv)

Recreate the following plot using facet_wrap() instead of
facet_grid(). How do the positions of the facet labels change?
ggplot(mpg) +
geom_point(aes(x = displ, y = hwy)) +
facet_grid(drv ~ .)

ggplot(mpg) +
geom_point(aes(x = displ, y = hwy)) +
facet_wrap(drv ~ .)

- Place values of drv on x-axis make it easier compare engine size
(displ) across cars with different drive trains. It is better align the
comparison at the same axis for quick comparison.
- facet_wrap() changes the facet labels to the x-axis.
10.6 Position adjustments
- With a set x, whatever variables put in fill would return the
combination of the variable with the set X.
- Three other opotions “identity”, “dodge”, and “fill”
- position = “identity” will place each object exactly where it falls
in the context of the graph. This is not very useful for bars, because
it overlaps them.
- Position = “fill” works like stacking, but makes each set of stacked
bars the same height. This makes it easier to compare proportions across
groups.
- Position = “dodge” places overlapping objects directly beside one
another. This makes it easier to compare individual values.
- Position = “jitter” adds a small amount of random noise to each
point.
10.6.1
1.What is the problem with the following plot? How could you improve
it?
ggplot(mpg, aes(x = cty, y = hwy)) +
geom_point()

ggplot(mpg, aes(x = cty, y = hwy)) +
geom_point(position = "jitter")

- The combination of cty and hwy include lots of overlapping.We can
add a position = “jitter”to create noise for each point.
2.What, if anything, is the difference between the two plots?
Why?
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point()

ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(position = "identity")
- There is no graphical difference between two plots, since position =
“identity” will place each object exactly where it falls in the context
of the graph, as default geom_point() represents.
What parameters to geom_jitter() control the amount of jittering?
-The width and height.
Compare and contrast geom_jitter() with geom_count(). -geom_count
would not change the poisition of points, but would overlap point with
close range to eact other, while geom_jitter() would add random
variations to points.
What’s the default position adjustment for geom_boxplot()? Create
a visualization of the mpg dataset that demonstrates it.
ggplot(data = mpg, aes(x = hwy, y = displ, colour = class)) +
geom_boxplot(position = "identity")
- The default position adjustment is position_dodge2.
10.7 Coordinate systems
- coord_quickmap() sets the aspect ratio correctly for geographic
maps. coord_polar() uses polar coordinates.
10.7.1 Exericises
1.Turn a stacked bar chart into a pie chart using coord_polar().
ggplot(mpg, aes(x = factor(1), fill = class)) +
geom_bar()

ggplot(mpg, aes(x = factor(1), fill = class)) +
geom_bar(width = 1) +
coord_polar(theta = "y")

2.What’s the difference between coord_quickmap() and coord_map()? -?
coord_quickmap(). ?coord_map() - coord_quickmap() makes maps faster but
more approxmiate than coord_map()
3.What does the following plot tell you about the relationship
between city and highway mpg? Why is coord_fixed() important? What does
geom_abline() do?
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point() +
geom_abline() +
coord_fixed()

- Without coord_fixed(), the coordinate would change and miss some
points. geom_abline() plays a role of adding the fixed makes it easy to
compare the highway and city mileage.
10.8 The layered grammar of graphics
- Select the necessary data from raw data to create the graph you
needed.
11 Exploratory data analysis
11.1 Introduction
- Questions around the data/Search for answers by visualizing,
transforming, and modelling your data/Use what you learn to refine your
questions and/or generate new questions.
11.2 Questions
Using visualization and grouping. How are the observations within
each subgroup similar to each other? How are the observations in
separate clusters different from each other? How can you explain or
describe the clusters? Why might the appearance of clusters be
misleading?
coord_cartesian() can be used to select unusual values
11.3.3 Exercises
1.Explore the distribution of each of the x, y, and z variables in
diamonds. What do you learn? Think about a diamond and how you might
decide which dimension is the length, width, and depth.
#Distribution of x
ggplot(diamonds) +
geom_histogram(mapping = aes(x = x), binwidth = 0.01)

#Distribution of y
ggplot(diamonds) +
geom_histogram(mapping = aes(x = y), binwidth = 0.01)

#Distribution of z
ggplot(diamonds) +
geom_histogram(mapping = aes(x = z), binwidth = 0.01)
- All three variables are right skewed in general and having some
noticing outliners. Probably y is length for its larger values in
general, z would be the width, and x is the depth.
2.Explore the distribution of price. Do you discover anything unusual
or surprising? (Hint: Carefully think about the binwidth and make sure
you try a wide range of values.)
ggplot(diamonds, aes(x = price)) +
geom_histogram(binwidth = 10, center = 0)
- The price is left skewed in general, and has a lot of outlines.Most
points are located below 2000$.
3.How many diamonds are 0.99 carat? How many are 1 carat? What do you
think is the cause of the difference?
#Diamonds with 0.99 and 1 carat
diamonds %>%
filter(carat >= 0.99, carat <= 1) %>%
count(carat)
## # A tibble: 2 × 2
## carat n
## <dbl> <int>
## 1 0.99 23
## 2 1 1558
- There are 1558 1-carat diamonds and 23 0.99-carat diamonds. The
difference could be natural pheromone that 0.99-carat diamonds are
naturally more rare.
4.Compare and contrast coord_cartesian() vs. xlim() or ylim() when
zooming in on a histogram. What happens if you leave binwidth unset?
What happens if you try and zoom so only half a bar shows?
#Leave bin width unset
ggplot(diamonds) +
geom_histogram(mapping = aes(x = price)) +
coord_cartesian(xlim = c(114, 514), ylim = c(1000, 4000))

ggplot(diamonds) +
geom_histogram(mapping = aes(x = price)) +
xlim(114, 514) +
ylim(1000, 4000)
- coord_cartesian() is zooming a particular part of the entire
coordinate, while xlim() and ylim() select the particular part out of
the coordinate. As we see above, the xlim() and ylim() cut off points
out of the range selected. Leave bin width unset has no real effect on
the display.
11.4 Unusual Vlaves & Exercises
- Drop rows with strange values.
1.What happens to missing values in a histogram? What happens to
missing values in a bar chart? Why is there a difference in how missing
values are handled in histograms and bar charts?
- (Google)The missing values in a histogram would be removed. “In the
geom_bar() function, NA is treated as another category. The x aesthetic
in geom_bar() requires a discrete (categorical) variable, and missing
values act like another category. In a histogram, the x aesthetic
variable needs to be numeric, and stat_bin() groups the observations by
ranges into bins. Since the numeric value of the NA observations is
unknown, they cannot be placed in a particular bin, and are
dropped.”
2.What does na.rm = TRUE do in mean() and sum()?
- It reomves NA values before calculating mean and sum.
3.Recreate the frequency plot of scheduled_dep_time colored by
whether the flight was cancelled or not. Also facet by the cancelled
variable. Experiment with different values of the scales variable in the
faceting function to mitigate the effect of more non-cancelled flights
than cancelled flights.
nycflights13::flights |>
mutate(
cancelled = is.na(dep_time),
sched_hour = sched_dep_time %/% 100,
sched_min = sched_dep_time %% 100,
sched_dep_time = sched_hour + (sched_min / 60)
) |>
ggplot(aes(x = sched_dep_time)) +
geom_freqpoly(aes(color = cancelled), binwidth = 1/4)

11.5
- Covariation is the tendency for the values of two or more variables
to vary together in a related way.
11.5.1 A categorical and a numerical variable
- fct_reorder() to reorder variables for a more information
disply.
ggplot(mpg, aes(x = fct_reorder(class, hwy, median), y = hwy)) +
geom_boxplot()

11.5.1.1 Exercises
1.Use what you’ve learned to improve the visualization of the
departure times of cancelled vs. non-cancelled flights.
nycflights13::flights |>
mutate(cancelled = is.na(dep_time) | is.na(arr_time)) %>%
ggplot() +
geom_boxplot(aes(x = cancelled, y = dep_time))

2.Based on EDA, what variable in the diamonds dataset appears to be
most important for predicting the price of a diamond? How is that
variable correlated with cut? Why does the combination of those two
relationships lead to lower quality diamonds being more expensive?
#Correlation between carat and price
ggplot(diamonds) +
geom_point(aes(x = carat, y = price), color = "green", alpha = 0.76)

#Correlation between depth and price
ggplot(diamonds) +
geom_point(aes(x = depth, y = price), color = "green", alpha = 0.76)

#Correlation between z and price
ggplot(diamonds) +
geom_point(aes(x = z, y = price), color = "green", alpha = 0.76)

#Correlation between x and price
ggplot(diamonds) +
geom_point(aes(x = x, y = price), color = "green", alpha = 0.76)
- It appears that x has the highest correlation with price among these
variables.
3.Instead of exchanging the x and y variables, add coord_flip() as a
new layer to the vertical boxplot to create a horizontal one. How does
this compare to exchanging the variables?
4.One problem with boxplots is that they were developed in an era of
much smaller datasets and tend to display a prohibitively large number
of “outlying values”. One approach to remedy this problem is the letter
value plot. Install the lvplot package, and try using geom_lv() to
display the distribution of price vs. cut. What do you learn? How do you
interpret the plots?
5.Create a visualization of diamond prices vs. a categorical variable
from the diamonds dataset using geom_violin(), then a faceted
geom_histogram(), then a colored geom_freqpoly(), and then a colored
geom_density(). Compare and contrast the four plots. What are the pros
and cons of each method of visualizing the distribution of a numerical
variable based on the levels of a categorical variable?
6.If you have a small dataset, it’s sometimes useful to use
geom_jitter() to avoid overplotting to more easily see the relationship
between a continuous and categorical variable. The ggbeeswarm package
provides a number of methods similar to geom_jitter(). List them and
briefly describe what each one does.
11.5.2 Two categorical variables
- To create the plot with two categorical variables, use
geom_count()
- Use dplyr to computing the counts between these variables.
- Visualize with geom_tile() and the fill aesthetic.
11.5.3 Two numerical variables
- Surely use geom_point() to plot the numerical variables.alpha
aesthetic can add transparency.
- New tools to bin in one dimension: geom_bin2d() and geom_hex().
11.6 Patterns and models
- For a systematic relationship exists between two variables, it will
appear as a pattern in the data. Such pattern should be considered: if
it is a coincidence? What relationship implied by the pattern? How
strong the relationship is? What other verbs may affect the
relationship? Does the relationship change if you look at individual
subgroups of the data?
12 Communication
12.1.1 Prerequisites
library(tidyverse)
library(scales)
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
library(ggrepel)
library(patchwork)
12.2 Labels
- labs() adds names to elements in the coordinate. Those elements
include x, y, color, title, subtitle, caption.
12.2.1 Exercises
1.Create one plot on the fuel economy data with customized title,
subtitle, caption, x, y, and color labels.
ggplot()+
geom_point(data = mpg, aes( x = hwy, y = displ, colour = drv, shape = drv))+
labs( x = "Engine displacement (L)",
y = "Highway fuel economy (mpg)",
title = "Large engine displacement results in lower gas mileage performance",
subtitle = "SUV and pickup classes have more small engine & high mpg combination",
caption = "Dataset from tidyverse")

2.Recreate the following plot using the fuel economy data. Note that
both the colors and shapes of points vary by type of drive train.
ggplot(mpg, aes(x = cty, y= hwy, shape = drv, color = drv))+
geom_point()+
labs( x = "City MPG",
y = "Highway MPG",
shape = "Type of drive train")
3.Take an exploratory graphic that you’ve created in the last month, and
add informative titles to make it easier for others to understand.
12.3 Annotations
12.3.1 Exercises
1.Use geom_text() with infinite positions to place text at the four
corners of the plot.
label <- tribble(
~displ, ~hwy, ~label, ~vjust, ~hjust,
Inf, Inf, "Top right", "top", "right",
Inf, -Inf, "Bottom right", "bottom", "right",
-Inf, Inf, "Top left", "top", "left",
-Inf, -Inf, "Bottom left", "bottom", "left"
)
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_text(aes(label = label, vjust = vjust, hjust = hjust), data = label)

- Use annotate() to add a point geom in the middle of your last plot
without having to create a tibble. Customize the shape, size, or color
of the point.
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
annotate(geom = "label", x= max(mpg$displ), y= max(mpg$hwy),
label = "Top right", vjust = "top",
hjust = "right", color = "red"
) +
annotate(geom = "label", x= min(mpg$displ), y= max(mpg$hwy),
label = "Top left", vjust = "top",
hjust = "left", color = "red"
) +
annotate(geom = "label", x= max(mpg$displ), y= min(mpg$hwy),
label = "Bottom right", vjust = "bottom",
hjust = "right", color = "red"
) +
annotate(geom = "label", x= min(mpg$displ), y= min(mpg$hwy),
label = "Bottom left", vjust = "bottom",
hjust = "left", color = "red"
)

- How do labels with geom_text() interact with faceting? How can you
add a label to a single facet? How can you put a different label in each
facet? (Hint: Think about the dataset that is being passed to
geom_text().)
# labels in each different plots
label <- tibble(
displ = Inf,
hwy = Inf,
class = unique(mpg$class),
label = str_c("Label for ", class)
)
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_text(aes(label = label),
data = label, vjust = "top", hjust = "right",
size = 3
) +
facet_wrap(~class)

- What arguments to geom_label() control the appearance of the
background box?
-label.padding: padding around label -label.r: amount of rounding in
the corners -label.size: size of label border
- What are the four arguments to arrow()? How do they work? Create a
series of plots that demonstrate the most important options.
-angle : angle of arrow head -length : length of the arrow head
-ends: ends of the line to draw arrow head -type: “open” or “close”:
whether the arrow head is a closed or open triangle
12.4 Scales
12.4.1 Default scales
- scale_ followed by the name of the aesthetic, then _, then the name
of the scale. The default scales are named according to the type of
variable they align with: continuous, discrete, datetime, or date.
scale_x_continuous() puts the numeric values from displ on a continuous
number line on the x-axis, scale_color_discrete() chooses colors for
each of the class of car, etc.
12.4.2 Axis ticks and legend keys
-There are two primary arguments that affect the appearance of the
ticks on the axes and the keys on the legend: breaks and labels. Breaks
controls the position of the ticks, or the values associated with the
keys. Labels controls the text label associated with each tick/key.
-We can use the labels in the same way. label_dollar will add dollar
sign. label_percent() add percentage.
12.4.3 Legend layout
-To control the overall position of the legend, you need to use a
theme() setting.The theme setting legend.position controls where the
legend is drawn.
-To control the display of individual legends, use guides() along
with guide_legend() or guide_colorbar(). Note that the name of the
argument in guides() matches the name of the aesthetic, just like in
labs().
12.4.4 Replacing a scale
-It’s very useful to plot transformations of your variable. The
ColorBrewer scales are documented online at https://colorbrewer2.org/ and made available in R via
the RColorBrewer package, by Erich Neuwirth.
-For continuous color, you can use the built-in
scale_color_gradient() or scale_fill_gradient(). If you have a diverging
scale, you can use scale_color_gradient2().
-Note that all color scales come in two varieties: scale_color_()
and scale_fill_() for the color and fill aesthetics respectively
(the color scales are available in both UK and US spellings).
12.4.5 Zooming
-There are three ways to control the plot limits:
-Adjusting what data are plotted. -Setting the limits in each scale.
-Setting xlim and ylim in coord_cartesian().
-To zoom in on a region of the plot, it’s generally best to use
coord_cartesian().
-Setting the limits on individual scales is generally more useful if
you want to expand the limits, e.g., to match scales across different
plots.
12.4.6 Exercises
- Why doesn’t the following code override the default scale?
df <- tibble(
x = rnorm(10000),
y = rnorm(10000)
)
ggplot(df, aes(x, y)) +
geom_hex() +
scale_color_gradient(low = "white", high = "red") +
coord_fixed()
-Because the colors in geom_hex() are set by the fill aesthetic, not the
color aesthetic.
- What is the first argument to every scale? How does it compare to
labs()?
-The first argument to every scale is the label for the scale. It is
equivalent to using the labs function.
- Change the display of the presidential terms by:
a.Combining the two variants that customize colors and x axis breaks.
b.Improving the display of the y axis. c.Labelling each term with the
name of the president. d.Adding informative plot labels. e.Placing
breaks every 4 years (this is trickier than it seems!).
fouryears <- lubridate::make_date(seq(year(min(presidential$start)),
year(max(presidential$end)),
by = 4
), 1, 1)
presidential %>%
mutate(
id = 33 + row_number(),
name_id = fct_inorder(str_c(name, " (", id, ")"))
) %>%
ggplot(aes(start, name_id, colour = party)) +
geom_point() +
geom_segment(aes(xend = end, yend = name_id)) +
scale_colour_manual("Party", values = c(Republican = "red", Democratic = "blue")) +
scale_y_discrete(NULL) +
scale_x_date(NULL,
breaks = presidential$start, date_labels = "'%y",
minor_breaks = fouryears
) +
ggtitle("Terms of US Presdients",
subtitle = "Roosevelth (34th) to Obama (44th)"
) +
theme(
panel.grid.minor = element_blank(),
axis.ticks.y = element_blank()
)
4.First, create the following plot. Then, modify the code using
override.aes to make the legend easier to see.
ggplot(diamonds, aes(x = carat, y = price)) +
geom_point(aes(color = cut), alpha = 1/20)+
theme(legend.position = "bottom")+
guides(color = guide_legend(nrow=2, override.aes = list(alpha = 1)))

12.5 Themes
12.5.1 Exercises
- Pick a theme offered by the ggthemes package and apply it to the
last plot you made.
ggplot(diamonds, aes(x = carat, y = price)) +
geom_point(aes(color = cut), alpha = 1/20)+
theme(legend.position = c(0.6, 0.7),
legend.direction = "horizontal",
legend.box.background = element_rect(color = "black"),
plot.title = element_text(face = "bold"),
plot.title.position = "plot",
plot.caption.position = "plot",
plot.caption = element_text(hjust = 0))+
guides(color = guide_legend(nrow=2, override.aes = list(alpha = 1)))

12.6 Layout
12.6.1 Exercises
- What happens if you omit the parentheses in the following plot
layout. Can you explain why this happens?
p1 <- ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
labs(title = "Plot 1")
p2 <- ggplot(mpg, aes(x = drv, y = hwy)) +
geom_boxplot() +
labs(title = "Plot 2")
p3 <- ggplot(mpg, aes(x = cty, y = hwy)) +
geom_point() +
labs(title = "Plot 3")
(p1 | p2) / p3

-The plot layout would not be generated. Because,the parentheses
designs the position of different plots in the layout.
- Using the three plots from the previous exercise, recreate the
following patchwork.
Three plots: Plot 1 is a scatterplot of highway mileage versus engine
size. Plot 2 is side-by-side box plots of highway mileage versus drive
train. Plot 3 is side-by-side box plots of city mileage versus drive
train. Plots 1 is on the first row. Plots 2 and 3 are on the next row,
each span half the width of Plot 1. Plot 1 is labelled “Fig. A”, Plot 2
is labelled “Fig. B”, and Plot 3 is labelled “Fig. C”.
po1 <- ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
labs(title = "Plot 1")
po2 <- ggplot(mpg, aes(x = drv, y = hwy)) +
geom_boxplot() +
labs(title = "Plot 2")
po3 <- ggplot(mpg, aes(x = cty, y = hwy)) +
geom_point() +
labs(title = "Plot 3")
po1/(po2 | po3)

13 Logical vectors
13.1 Introduction
13.1.1 Prerequisites
library(tidyverse)
library(nycflights13)
#do to a variable inside a data frame with mutate()
13.2 Comparison
-We use filter() and <, <=, >, >=, != to make Comparison.
-Use digits to compare floating point
- How does dplyr::near() work? Type near to see the source code. Is
sqrt(2)^2 near 2?
near(sqrt(2)^2, 2)
## [1] TRUE
- Use mutate(), is.na(), and count() together to describe how the
missing values in dep_time, sched_dep_time and dep_delay are
connected.
flights |>
mutate(dep_time_na = is.na(dep_time),
sched_dep_time_na = is.na(sched_dep_time),
dep_delay_na = is.na(dep_delay)) |>
count(dep_time_na, sched_dep_time_na, dep_delay_na)
## # A tibble: 2 × 4
## dep_time_na sched_dep_time_na dep_delay_na n
## <lgl> <lgl> <lgl> <int>
## 1 FALSE FALSE FALSE 328521
## 2 TRUE FALSE TRUE 8255
13.3 Boolean algebra
1.Find all flights where arr_delay is missing but dep_delay is not.
Find all flights where neither arr_time nor sched_arr_time are missing,
but arr_delay is.
#Find all flights where arr_delay is missing but dep_delay is not.
flights |>
filter(is.na(arr_delay & ! dep_delay))
## # A tibble: 8,303 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 NA 1630 NA NA 1815
## 2 2013 1 1 NA 1935 NA NA 2240
## 3 2013 1 1 NA 1500 NA NA 1825
## 4 2013 1 1 NA 600 NA NA 901
## 5 2013 1 2 NA 1540 NA NA 1747
## 6 2013 1 2 NA 1620 NA NA 1746
## 7 2013 1 2 NA 1355 NA NA 1459
## 8 2013 1 2 NA 1420 NA NA 1644
## 9 2013 1 2 NA 1321 NA NA 1536
## 10 2013 1 2 NA 1545 NA NA 1910
## # ℹ 8,293 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
#Find all flights where neither arr_time nor sched_arr_time are missing
nycflights13::flights |>
filter(is.na(sched_arr_time & arr_time & !arr_delay))
## # A tibble: 9,430 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 1525 1530 -5 1934 1805
## 2 2013 1 1 1528 1459 29 2002 1647
## 3 2013 1 1 1740 1745 -5 2158 2020
## 4 2013 1 1 1807 1738 29 2251 2103
## 5 2013 1 1 1939 1840 59 29 2151
## 6 2013 1 1 1952 1930 22 2358 2207
## 7 2013 1 1 2016 1930 46 NA 2220
## 8 2013 1 1 NA 1630 NA NA 1815
## 9 2013 1 1 NA 1935 NA NA 2240
## 10 2013 1 1 NA 1500 NA NA 1825
## # ℹ 9,420 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
2.How many flights have a missing dep_time? What other variables are
missing in these rows? What might these rows represent?
#How many flights have a missing dep_time
flights |>
count(is.na(dep_time))
## # A tibble: 2 × 2
## `is.na(dep_time)` n
## <lgl> <int>
## 1 FALSE 328521
## 2 TRUE 8255
#What other variables are missing in these rows?
summary(flights)
## year month day dep_time sched_dep_time
## Min. :2013 Min. : 1.000 Min. : 1.00 Min. : 1 Min. : 106
## 1st Qu.:2013 1st Qu.: 4.000 1st Qu.: 8.00 1st Qu.: 907 1st Qu.: 906
## Median :2013 Median : 7.000 Median :16.00 Median :1401 Median :1359
## Mean :2013 Mean : 6.549 Mean :15.71 Mean :1349 Mean :1344
## 3rd Qu.:2013 3rd Qu.:10.000 3rd Qu.:23.00 3rd Qu.:1744 3rd Qu.:1729
## Max. :2013 Max. :12.000 Max. :31.00 Max. :2400 Max. :2359
## NA's :8255
## dep_delay arr_time sched_arr_time arr_delay
## Min. : -43.00 Min. : 1 Min. : 1 Min. : -86.000
## 1st Qu.: -5.00 1st Qu.:1104 1st Qu.:1124 1st Qu.: -17.000
## Median : -2.00 Median :1535 Median :1556 Median : -5.000
## Mean : 12.64 Mean :1502 Mean :1536 Mean : 6.895
## 3rd Qu.: 11.00 3rd Qu.:1940 3rd Qu.:1945 3rd Qu.: 14.000
## Max. :1301.00 Max. :2400 Max. :2359 Max. :1272.000
## NA's :8255 NA's :8713 NA's :9430
## carrier flight tailnum origin
## Length:336776 Min. : 1 Length:336776 Length:336776
## Class :character 1st Qu.: 553 Class :character Class :character
## Mode :character Median :1496 Mode :character Mode :character
## Mean :1972
## 3rd Qu.:3465
## Max. :8500
##
## dest air_time distance hour
## Length:336776 Min. : 20.0 Min. : 17 Min. : 1.00
## Class :character 1st Qu.: 82.0 1st Qu.: 502 1st Qu.: 9.00
## Mode :character Median :129.0 Median : 872 Median :13.00
## Mean :150.7 Mean :1040 Mean :13.18
## 3rd Qu.:192.0 3rd Qu.:1389 3rd Qu.:17.00
## Max. :695.0 Max. :4983 Max. :23.00
## NA's :9430
## minute time_hour
## Min. : 0.00 Min. :2013-01-01 05:00:00.00
## 1st Qu.: 8.00 1st Qu.:2013-04-04 13:00:00.00
## Median :29.00 Median :2013-07-03 10:00:00.00
## Mean :26.23 Mean :2013-07-03 05:22:54.64
## 3rd Qu.:44.00 3rd Qu.:2013-10-01 07:00:00.00
## Max. :59.00 Max. :2013-12-31 23:00:00.00
##
-These rows might represent the cancelled flight
3.Assuming that a missing dep_time implies that a flight is
cancelled, look at the number of cancelled flights per day. Is there a
pattern? Is there a connection between the proportion of cancelled
flights and the average delay of non-cancelled flights?
#definitely cancelled.
cancelled_per_day <-
flights %>%
mutate(cancelled = (is.na(arr_delay) | is.na(dep_delay))) %>%
group_by(year, month, day) %>%
summarise(
cancelled_num = sum(cancelled),
flights_num = n(),
)
# It is likely that days with more flights would have a higher probability of cancellations
ggplot(cancelled_per_day) +
geom_point(aes(x = flights_num, y = cancelled_num))

#Is there a connection between the proportion of cancelled flights and the average delay of non-cancelled flights?
flights %>% group_by(month, day) %>%
summarize(avg_dep_delay = mean(dep_delay, na.rm = TRUE),
prop_cancelled = sum(is.na(dep_time)/n())) %>%
ggplot(mapping = aes(x = avg_dep_delay, y = prop_cancelled)) +
geom_point() +
geom_smooth(method = 'lm', se = FALSE)

13.4 Summaries
13.4.1 Logical summaries
-There are two main logical summaries: any() and all(). any(x) is the
equivalent of |; it’ll return TRUE if there are any TRUE’s in x. all(x)
is equivalent of &; it’ll return TRUE only if all values of x are
TRUE’s.
13.4.4 Exercises
1.What will sum(is.na(x)) tell you? How about mean(is.na(x))?
-sum(is.na(x)) will return the number of NAs in x by the number of
TRUES. mean() gives the proportion of NAs in x by the form of TRUES
2.What does prod() return when applied to a logical vector? What
logical summary function is it equivalent to? What does min() return
when applied to a logical vector? What logical summary function is it
equivalent to? Read the documentation and perform a few experiments.
-When applied to a lgocial vector, prod() will return the product of
all the elements in the vector, treating TRUE as 1 and FALSE as 0. It
equals to the & -When applied the min() function to a logical
vector, it returns FALSE if there are any FALSE values in the vector,
and TRUE if all values are TRUE. It’s equivalent to using the all()
function.
14 Numbers
14.1 Introduction
-We’ll start by giving you a couple of tools to make numbers if you
have strings, and then going into a little more detail of count(). Then
we’ll dive into various numeric transformations that pair well with
mutate(), including more general transformations that can be applied to
other types of vectors, but are often used with numeric vectors. We’ll
finish off by covering the summary functions that pair well with
summarize() and show you how they can also be used with mutate().
14.1.1 Prerequisites
library(tidyverse)
library(nycflights13)
14.2 Making numbers
x <- c("1.2", "5.6", "1e3")
parse_double(x)
## [1] 1.2 5.6 1000.0
# Use parse_number() when the string contains non-numeric text that you want to ignore.
x <- c("$1,234", "USD 3,513", "59%")
parse_number(x)
## [1] 1234 3513 59
14.3 Counts
#How count() works
flights |> count(dest)
## # A tibble: 105 × 2
## dest n
## <chr> <int>
## 1 ABQ 254
## 2 ACK 265
## 3 ALB 439
## 4 ANC 8
## 5 ATL 17215
## 6 AUS 2439
## 7 AVL 275
## 8 BDL 443
## 9 BGR 375
## 10 BHM 297
## # ℹ 95 more rows
#If you want to see the most common values, add sort = TRUE
flights |> count(dest, sort = TRUE)
## # A tibble: 105 × 2
## dest n
## <chr> <int>
## 1 ORD 17283
## 2 ATL 17215
## 3 LAX 16174
## 4 BOS 15508
## 5 MCO 14082
## 6 CLT 14064
## 7 SFO 13331
## 8 FLL 12055
## 9 MIA 11728
## 10 DCA 9705
## # ℹ 95 more rows
#if you want to see all the values, you can use |> View() or |> print(n = Inf).
#You can perform the same computation “by hand” with group_by(), summarize() and n().
flights |>
group_by(dest) |>
summarize(
n = n(),
delay = mean(arr_delay, na.rm = TRUE))
## # A tibble: 105 × 3
## dest n delay
## <chr> <int> <dbl>
## 1 ABQ 254 4.38
## 2 ACK 265 4.85
## 3 ALB 439 14.4
## 4 ANC 8 -2.5
## 5 ATL 17215 11.3
## 6 AUS 2439 6.02
## 7 AVL 275 8.00
## 8 BDL 443 7.05
## 9 BGR 375 8.03
## 10 BHM 297 16.9
## # ℹ 95 more rows
#n_distinct(x) counts the number of distinct (unique) values of one or more variables.
flights |>
group_by(dest) |>
summarize(carriers = n_distinct(carrier)) |>
arrange(desc(carriers))
## # A tibble: 105 × 2
## dest carriers
## <chr> <int>
## 1 ATL 7
## 2 BOS 7
## 3 CLT 7
## 4 ORD 7
## 5 TPA 7
## 6 AUS 6
## 7 DCA 6
## 8 DTW 6
## 9 IAD 6
## 10 MSP 6
## # ℹ 95 more rows
#A weighted count is a sum. For example you could “count” the number of miles each plane flew:
flights |>
group_by(tailnum) |>
summarize(miles = sum(distance))
## # A tibble: 4,044 × 2
## tailnum miles
## <chr> <dbl>
## 1 D942DN 3418
## 2 N0EGMQ 250866
## 3 N10156 115966
## 4 N102UW 25722
## 5 N103US 24619
## 6 N104UW 25157
## 7 N10575 150194
## 8 N105UW 23618
## 9 N107US 21677
## 10 N108UW 32070
## # ℹ 4,034 more rows
#Weighted counts are a common problem so count() has a wt argument that does the same thing:
flights |> count(tailnum, wt = distance)
## # A tibble: 4,044 × 2
## tailnum n
## <chr> <dbl>
## 1 D942DN 3418
## 2 N0EGMQ 250866
## 3 N10156 115966
## 4 N102UW 25722
## 5 N103US 24619
## 6 N104UW 25157
## 7 N10575 150194
## 8 N105UW 23618
## 9 N107US 21677
## 10 N108UW 32070
## # ℹ 4,034 more rows
#You can count missing values by combining sum() and is.na().
flights |>
group_by(dest) |>
summarize(n_cancelled = sum(is.na(dep_time)))
## # A tibble: 105 × 2
## dest n_cancelled
## <chr> <int>
## 1 ABQ 0
## 2 ACK 0
## 3 ALB 20
## 4 ANC 0
## 5 ATL 317
## 6 AUS 21
## 7 AVL 12
## 8 BDL 31
## 9 BGR 15
## 10 BHM 25
## # ℹ 95 more rows
14.3.1 Exercises
1.How can you use count() to count the number rows with a missing
value for a given variable?
flights |>
group_by(dest) |>
count(is.na(dep_time))
## # A tibble: 203 × 3
## # Groups: dest [105]
## dest `is.na(dep_time)` n
## <chr> <lgl> <int>
## 1 ABQ FALSE 254
## 2 ACK FALSE 265
## 3 ALB FALSE 419
## 4 ALB TRUE 20
## 5 ANC FALSE 8
## 6 ATL FALSE 16898
## 7 ATL TRUE 317
## 8 AUS FALSE 2418
## 9 AUS TRUE 21
## 10 AVL FALSE 263
## # ℹ 193 more rows
2.Expand the following calls to count() to instead use group_by(),
summarize(), and arrange(): flights |> count(dest, sort = TRUE)
flights |> count(tailnum, wt = distance)
flights |>
group_by(dest) |>
summarize(n = n()) |>
arrange(desc(n))
## # A tibble: 105 × 2
## dest n
## <chr> <int>
## 1 ORD 17283
## 2 ATL 17215
## 3 LAX 16174
## 4 BOS 15508
## 5 MCO 14082
## 6 CLT 14064
## 7 SFO 13331
## 8 FLL 12055
## 9 MIA 11728
## 10 DCA 9705
## # ℹ 95 more rows
flights |>
group_by(tailnum) |>
summarize(total_distance = sum(distance, na.rm = TRUE)) |>
arrange(desc(total_distance))
## # A tibble: 4,044 × 2
## tailnum total_distance
## <chr> <dbl>
## 1 <NA> 1784167
## 2 N328AA 939101
## 3 N338AA 931183
## 4 N327AA 915665
## 5 N335AA 909696
## 6 N323AA 844529
## 7 N319AA 840510
## 8 N336AA 838086
## 9 N329AA 830776
## 10 N324AA 794895
## # ℹ 4,034 more rows
14.5 General transformations
14.5.1 Ranks
#Note that the smallest values get the lowest ranks; use desc(x) to give the largest values the smallest ranks:
x <- c(1, 2, 2, 3, 4, NA)
min_rank(desc(x))
## [1] 5 3 3 2 1 NA
-View documents of dplyr::row_number(), dplyr::dense_rank(),
dplyr::percent_rank(), and dplyr::cume_dist()
14.5.2 Offsets
#dplyr::lead() and dplyr::lag() allow you to refer the values just before or just after the “current” value. They return a vector of the same length as the input, padded with NAs at the start or end:
x <- c(2, 5, 11, 11, 19, 35)
lag(x)
## [1] NA 2 5 11 11 19
lead(x)
## [1] 5 11 11 19 35 NA
#x - lag(x) gives you the difference between the current and previous value.
x - lag(x)
## [1] NA 3 6 0 8 16
#x == lag(x) tells you when the current value changes.
x == lag(x)
## [1] NA FALSE FALSE TRUE FALSE FALSE
14.5.3 Consecutive identifiers
- cumsum() will increment group by one.
- Another approach for creating grouping variables is
consecutive_id(), which starts a new group every time one of its
arguments changes.
- To keep the first row from each repeated x, you could use
group_by(), consecutive_id(), and slice_head()
Exercise
1.Find the 10 most delayed flights using a ranking function. How do
you want to handle ties? Carefully read the documentation for
min_rank().
flights |>
arrange(desc(dep_delay)) |>
mutate(rank = min_rank(desc(dep_delay))) |>
filter(rank <= 10)
## # A tibble: 10 × 20
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <dbl> <int> <dbl> <dbl> <int>
## 1 2013 1 9 6.67 900 1301 12.7 1530
## 2 2013 6 15 14.5 1935 1137 16.1 2120
## 3 2013 1 10 11.3 1635 1126 12.7 1810
## 4 2013 9 20 11.7 1845 1014 14.9 2210
## 5 2013 7 22 8.75 1600 1005 10.8 1815
## 6 2013 4 10 11 1900 960 13.7 2211
## 7 2013 3 17 23.3 810 911 1.58 1020
## 8 2013 6 27 10 1900 899 12.6 2226
## 9 2013 7 22 22.9 759 898 1.33 1026
## 10 2013 12 5 7.92 1700 896 11 2020
## # ℹ 12 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>, rank <int>
2.Which plane (tailnum) has the worst on-time record?
flights |>
group_by(tailnum) |>
summarize(average_delay = mean(dep_delay, na.rm = TRUE)) |>
arrange(average_delay, na.last = TRUE)
## # A tibble: 4,044 × 2
## tailnum average_delay
## <chr> <dbl>
## 1 N785SK -14
## 2 N710SK -13
## 3 N701SK -11
## 4 N726SK -11
## 5 N859AS -11
## 6 N17627 -10.5
## 7 N14628 -10
## 8 N794SK -10
## 9 N583AS -9.5
## 10 N509AA -9
## # ℹ 4,034 more rows
3.What time of day should you fly if you want to avoid delays as much
as possible?
flights <- flights |>
mutate(hour = as.numeric(substring(sched_dep_time, 1, 2)))
average_delay_by_hour <- flights |>
group_by(hour) |>
summarize(average_delay = mean(dep_delay, na.rm = TRUE))
average_delay_by_hour |>
filter(average_delay == min(average_delay))
## # A tibble: 1 × 2
## hour average_delay
## <dbl> <dbl>
## 1 50 -3.11
4.What does flights |> group_by(dest) |> filter(row_number()
< 4) do? What does flights |> group_by(dest) |>
filter(row_number(dep_delay) < 4) do?
-The first line of codes filters the grouped data to keep only the
rows where the row number (order within each destination group) is less
than 4 by ‘dest’ -The second line filters the data based on the row
numbers within each destination group considering the ‘dep_delay’
column. It selects the first three rows within each destination
group
5.For each destination, compute the total minutes of delay. For each
flight, compute the proportion of the total delay for its
destination.
#Calculate the total minutes of delay for each destination
destination_dt <- flights |>
group_by(dest) |>
summarize(total_delay = sum(dep_delay, na.rm = TRUE))
#Join the origianl dataset
new_flights <- flights |>
left_join(destination_dt, by = "dest")
#Calculate the proportion of the total delay for each flight's destination
new_flights |>
mutate(proportion_of_total_delay = dep_delay / total_delay)
## # A tibble: 336,776 × 21
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <dbl> <int> <dbl> <dbl> <int>
## 1 2013 1 1 5.25 515 2 8.5 819
## 2 2013 1 1 5.58 529 4 8.83 830
## 3 2013 1 1 5.67 540 2 9.42 850
## 4 2013 1 1 5.75 545 -1 10.1 1022
## 5 2013 1 1 5.92 600 -6 8.17 837
## 6 2013 1 1 5.92 558 -4 7.67 728
## 7 2013 1 1 5.92 600 -5 9.25 854
## 8 2013 1 1 5.92 600 -3 7.17 723
## 9 2013 1 1 5.92 600 -3 8.67 846
## 10 2013 1 1 6 600 -2 7.92 745
## # ℹ 336,766 more rows
## # ℹ 13 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>, total_delay <dbl>,
## # proportion_of_total_delay <dbl>
6.Delays are typically temporally correlated: even once the problem
that caused the initial delay has been resolved, later flights are
delayed to allow earlier flights to leave. Using lag(), explore how the
average flight delay for an hour is related to the average delay for the
previous hour.
#The original set
flights |>
mutate(hour = dep_time %/% 100) |>
group_by(year, month, day, hour) |>
summarize(
dep_delay = mean(dep_delay, na.rm = TRUE),
n = n(),
.groups = "drop"
) |>
filter(n > 5)
## # A tibble: 578 × 6
## year month day hour dep_delay n
## <int> <int> <int> <dbl> <dbl> <int>
## 1 2013 1 1 0 11.5 838
## 2 2013 1 2 0 13.9 935
## 3 2013 1 2 NA NaN 8
## 4 2013 1 3 0 11.0 904
## 5 2013 1 3 NA NaN 10
## 6 2013 1 4 0 8.95 909
## 7 2013 1 4 NA NaN 6
## 8 2013 1 5 0 5.73 717
## 9 2013 1 6 0 7.15 831
## 10 2013 1 7 0 5.42 930
## # ℹ 568 more rows
#Modified
flights |>
mutate(hour = dep_time %/% 100) |>
group_by(year, month, day, hour) |>
summarize(
dep_delay = mean(dep_delay, na.rm = TRUE),
n = n(),
.groups = "drop"
) |>
filter(n > 5) |>
mutate(prev_hour_delay = lag(dep_delay)) |>
na.omit()
## # A tibble: 152 × 7
## year month day hour dep_delay n prev_hour_delay
## <int> <int> <int> <dbl> <dbl> <int> <dbl>
## 1 2013 1 2 0 13.9 935 11.5
## 2 2013 1 6 0 7.15 831 5.73
## 3 2013 1 7 0 5.42 930 7.15
## 4 2013 1 8 0 2.55 895 5.42
## 5 2013 1 9 0 2.28 897 2.55
## 6 2013 1 10 0 2.84 929 2.28
## 7 2013 1 11 0 2.82 919 2.84
## 8 2013 1 15 0 0.124 881 2.79
## 9 2013 1 20 0 6.78 782 3.48
## 10 2013 1 21 0 7.83 904 6.78
## # ℹ 142 more rows
-It seemingly the higher previous_hours_delay would cause higher
average flight delay for an hour.
7.Look at each destination. Can you find flights that are
suspiciously fast (i.e. flights that represent a potential data entry
error)? Compute the air time of a flight relative to the shortest flight
to that destination. Which flights were most delayed in the air?
#Find the shortest flight to each destination
shortest_flight <- flights |>
group_by(dest) |>
mutate(shortest_time = min(air_time),
mean_time = mean(air_time)) |>
ungroup() |>
mutate(diff_from_short = air_time-shortest_time,
diff_from_mean = air_time-mean_time) |>
arrange(diff_from_mean) |>
select(dest, shortest_time, air_time, diff_from_mean, diff_from_short, tailnum)
#Compute the air time of a flight relative to the shortest flight to that destination.
#flights |>
#left_join(shortest_flight, by = "dest") |>
#mutate(relative_air_time = air_time / shortest_time) |>
#arrange(desc(relative_air_time)) |>
#head(10)
-Flight N729JB, N531JB, and N566JB are the top three most delayed in
the air ?
8.Find all destinations that are flown by at least two carriers. Use
those destinations to come up with a relative ranking of the carriers
based on their performance for the same destination.
#Find all destinations that are flown by at least two carriers.
destinations_with_twomore_carriers <- flights |>
group_by(dest) |>
mutate(carrier_count = n_distinct(carrier)) |>
filter(carrier_count >= 2) |>
distinct(dest)
#Lets see the relative ranking of the carriers based on their performance for the same destination.
flights |>
filter(dest %in% destinations_with_twomore_carriers$dest) |>
group_by(carrier, dest) |>
summarize(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) |>
group_by(carrier) |>
summarize(relative_rank = mean(avg_dep_delay, na.rm = TRUE)) |>
arrange(relative_rank)
## `summarise()` has grouped output by 'carrier'. You can override using the
## `.groups` argument.
## # A tibble: 16 × 2
## carrier relative_rank
## <chr> <dbl>
## 1 US 3.85
## 2 HA 4.90
## 3 AS 5.80
## 4 VX 7.01
## 5 DL 7.17
## 6 FL 8.37
## 7 AA 9.90
## 8 MQ 11.2
## 9 YV 12.0
## 10 UA 12.5
## 11 9E 12.5
## 12 B6 13.0
## 13 WN 15.6
## 14 EV 19.4
## 15 F9 20.2
## 16 OO 27.6
-Seemingly the UA has the worst performance for the same
destination.
14.6 Numeric summaries
14.6.1 Center
-mean(), median()
14.6.2 Minimum, maximum, and quantiles
-min() and max() will give you the largest and smallest values.
-quantile() is a generalization of the median: quantile(x, 0.25) will
find the value of x that is greater than 25% of the values, quantile(x,
0.5) is equivalent to the median, and quantile(x, 0.95) will find the
value that’s greater than 95% of the values.
14.6.3 Spread
-IQR() might be new — it’s quantile(x, 0.75) - quantile(x, 0.25) and
gives you the range that contains the middle 50% of the data.
14.6.4 Distributions
-geom_freqpoly() can help create distribution
14.6.5 Positions
-Extracting a value at a specific position: first(x), last(x), and
nth(x, n) -Because dplyr functions use _ to separate components of
function and arguments names, these functions use na_rm instead of
na.rm.
14.6.6 With mutate()
-x / sum(x) calculates the proportion of a total. -(x - mean(x)) /
sd(x) computes a Z-score (standardized to mean 0 and sd 1). -(x -
min(x)) / (max(x) - min(x)) standardizes to range [0, 1]. -x / first(x)
computes an index based on the first observation.
14.6.7 Exercises (WARN)
- Brainstorm at least 5 different ways to assess the typical delay
characteristics of a group of flights. When is mean() useful? When is
median() useful? When might you want to use something else? Should you
use arrival delay or departure delay? Why might you want to use data
from planes?
-When is mean() useful?: when I want to get an overall sense of the
typical delay in a group of flights, and understand the central tendency
of the data
- When is median() useful?: In the case we want to assess the central
value that separates the higher half of delays from the lower half.
-When might you want to use something else? :When I want to specify
the data, like checking calculating quantiles (e.g., 25th percentile,
75th percentile) or percentiles to pick up performance on specific
flight.
- Which destinations show the greatest variation in air speed?
flights |>
group_by(dest) |>
summarize(variation = sd(distance/air_time, na.rm = TRUE)) |>
arrange(desc(variation)) |>
head(5)
## # A tibble: 5 × 2
## dest variation
## <chr> <dbl>
## 1 OKC 0.639
## 2 TUL 0.624
## 3 ILM 0.615
## 4 BNA 0.615
## 5 CLT 0.611
-The OKC shows the greatest variation in air speed
3.Create a plot to further explore the adventures of EGE. Can you
find any evidence that the airport moved locations? Can you find another
variable that might explain the difference? (Why this is empty?)
```{r,message = FALSE} EGE_flights <- flights |> filter(dest
== “EGE”)
EGE_flights |> group_by(year) |> summarize(num_flights = n())
|> ggplot(aes(x = year, y = num_flights)) + geom_line() + labs(x =
“Year”, y = “Number of Flights”)
# 15 Strings
## 15.1.1 Prerequisites
```r
library(tidyverse)
library(babynames)
15.2 Creating a string
-You can create a string using either single quotes (’) or double
quotes (“). There’s no difference in behavior between the two, so in the
interests of consistency, the tidyverse style guide recommends using”,
unless the string contains multiple “.
15.2.1 Escapes
#To include a literal single or double quote in a string, you can use \ to “escape” it:
double_quote <- "\"" # or '"'
single_quote <- '\'' # or "'"
double_quote
## [1] "\""
single_quote
## [1] "'"
#So if you want to include a literal backslash in your string, you’ll need to escape it: "\\":
backslash <- "\\"
backslash
## [1] "\\"
#To see the raw contents of the string, use str_view()
x <- c(single_quote, double_quote, backslash)
x
## [1] "'" "\"" "\\"
str_view(x)
## [1] │ '
## [2] │ "
## [3] │ \
15.2.2 Raw strings
-A raw string usually starts with r”( and finishes with )“. But if
your string contains )” you can instead use r”[]” or r”{}“, and if
that’s still not enough, you can insert any number of dashes to make the
opening and closing pairs unique, e.g., r”–()–“, r”—()—“, etc.
15.2.3 Other special characters
-The most common are , a new line, and tab. You’ll also sometimes see
strings containing Unicode escapes that start with r
15.2.4 Exercises
- Create strings that contain the following values:
He said “That’s amazing!”
\\
t1524 <- r"('He said "That's amazing!"'
"\a\b\c\d"
"\\\\\\")"
t1524
## [1] "'He said \"That's amazing!\"'\n\"\\a\\b\\c\\d\"\n\"\\\\\\\\\\\\\""
str_view(t1524)
## [1] │ 'He said "That's amazing!"'
## │ "\a\b\c\d"
## │ "\\\\\\"
- Create the string in your R session and print it. What happens to
the special “0a0”? How does str_view() display it? Can you do a little
googling to figure out what this special character is?
x <- "This\u00a0is\u00a0tricky"
x
## [1] "This is tricky"
str_view(x)
## [1] │ This{\u00a0}is{\u00a0}tricky
#lets try
x <- c("This", "\u00a0", "is", "\u00a0", "tricky")
x
## [1] "This" " " "is" " " "tricky"
-“0a0” does not generate results, and it is NO-BREAK SPACE!
15.3 Creating many strings from data
15.3.1 str_c()
-str_c() takes any number of vectors as arguments and returns a
character vector
#If you want missing values to display in another way, use coalesce() to replace them. Depending on what you want, you might use it either inside or outside of str_c():
df <- tibble(name = c("Flora", "David", "Terra", NA))
df |>
mutate(
greeting1 = str_c("Hi ", coalesce(name, "you"), "!"),
greeting2 = coalesce(str_c("Hi ", name, "!"), "Hi!")
)
## # A tibble: 4 × 3
## name greeting1 greeting2
## <chr> <chr> <chr>
## 1 Flora Hi Flora! Hi Flora!
## 2 David Hi David! Hi David!
## 3 Terra Hi Terra! Hi Terra!
## 4 <NA> Hi you! Hi!
15.3.2 str_glue()
- str_glue() converts missing values to the string “NA”.
15.3.3 str_flatten()
- str_flatten() takes a character vector and combines each element of
the vector into a single string, and work well with summarize()
15.3.4 Exercises
1.Compare and contrast the results of paste0() with str_c() for the
following inputs:
str_c("hi ", NA)
## [1] NA
paste0("hi ", NA)
## [1] "hi NA"
paste0(letters[1:2], letters[1:3])
## [1] "aa" "bb" "ac"
In the first case, paste9() treat NA as a string, and return “hi NA”.
In the second case, In the second case, the str_c() cannot recycle the
designated values in letters.
- What’s the difference between paste() and paste0()? How can you
recreate the equivalent of paste() with str_c()?
-?paste(), ?paste0() -paste0() is similar to paste(), but it has no
separator. The return of paste0() will have no blank between values.
- Convert the following expressions from str_c() to str_glue() or vice
versa:
str_c(“The price of”, food, ” is “, price)
str_glue(“I’m {age} years old and live in {country}”)
str_c(“\section{”, title, “}”)
-food <- c(‘food’) -price <-c(‘price’) -age <- c(‘age’)
-country <- c(‘country’)
-str_glue(“The price of {food} is {price}”)
-str_c(“I’m”, age, ” years old and live in “, country)
-str_glue(“\section{{{title}}}”)
15.4.2 Separating into columns
-separate_wider_delim() can separate a string into columns, but it
needs the delimiter and the names in the arguments.
-In the argument, you can use an NA name to omit it from results.
-separate_wider_position() works a little differently because you
typically want to specify the width of each column. So you give it a
named integer vector, where the name gives the name of the new column,
and the value is the number of characters it occupies. You can omit
values from the output by not naming them
15.4.3 Diagnosing widening problems
-separate_wider_delim() provides two arguments to help if some of the
rows don’t have the expected number of pieces: too_few and too_many.
- too_few = “debug” to ensure that new problems become errors. too_few
= “align_start” and too_few = “align_end” fill in the missing pieces
with NAs and move on.
15.5 Letters
15.5.1 Length
-str_length() tells you the number of letters in the string
15.5.2 Subsetting
-You can extract parts of a string using str_sub(string, start, end),
where start and end are the positions where the substring should start
and end.
Exercises (WARN)
- We could use str_sub() with mutate() to find the first and last
letter of each name (dont forget place the position of rows.)
babynames <- babynames::babynames
babynames |>
mutate(
first_letter = str_sub(name, 1, 1),
last_letter = str_sub(name, -1, -1)
)
## # A tibble: 1,924,665 × 7
## year sex name n prop first_letter last_letter
## <dbl> <chr> <chr> <int> <dbl> <chr> <chr>
## 1 1880 F Mary 7065 0.0724 M y
## 2 1880 F Anna 2604 0.0267 A a
## 3 1880 F Emma 2003 0.0205 E a
## 4 1880 F Elizabeth 1939 0.0199 E h
## 5 1880 F Minnie 1746 0.0179 M e
## 6 1880 F Margaret 1578 0.0162 M t
## 7 1880 F Ida 1472 0.0151 I a
## 8 1880 F Alice 1414 0.0145 A e
## 9 1880 F Bertha 1320 0.0135 B a
## 10 1880 F Sarah 1288 0.0132 S h
## # ℹ 1,924,655 more rows
- When computing the distribution of the length of babynames, why did
we use wt = n? Use str_length() and str_sub() to extract the middle
letter from each baby name. What will you do if the string has an even
number of characters?
-We use wt = n becasue it is a simple way to count the occurrences of
each unique name of babies (?)
babynames |>
mutate(
middle_letter = ifelse(str_length(name) %% 2 == 1,
str_sub(name, str_length(name) %/% 2 + 1,
str_length(name) %/% 2 + 1),
str_sub(name, str_length(name) %/% 2,
str_length(name) %/% 2 + 1) ))
## # A tibble: 1,924,665 × 6
## year sex name n prop middle_letter
## <dbl> <chr> <chr> <int> <dbl> <chr>
## 1 1880 F Mary 7065 0.0724 ar
## 2 1880 F Anna 2604 0.0267 nn
## 3 1880 F Emma 2003 0.0205 mm
## 4 1880 F Elizabeth 1939 0.0199 a
## 5 1880 F Minnie 1746 0.0179 nn
## 6 1880 F Margaret 1578 0.0162 ga
## 7 1880 F Ida 1472 0.0151 d
## 8 1880 F Alice 1414 0.0145 i
## 9 1880 F Bertha 1320 0.0135 rt
## 10 1880 F Sarah 1288 0.0132 r
## # ℹ 1,924,655 more rows
- Are there any major trends in the length of babynames over time?
What about the popularity of first and last letters?
#Part 1
library(babynames)
babynames1 <- babynames |>
group_by(year) |>
mutate(average_name_length = mean(nchar(name)))
ggplot(data = babynames1, aes(x = year, y = average_name_length)) +
geom_line() +
labs(x = "Year", y = "Average Name Length") +
ggtitle("Trends in the Length of Baby Names Over Time")

#Part 2
babynames2 <- babynames |>
mutate(first_letter = str_sub(name, 1, 1),
last_letter = str_sub(name, -1, -1)) |>
group_by(first_letter) |>
mutate(first_letter_count = n()) |>
group_by(last_letter) |>
mutate(last_letter_count = n())
ggplot(data = babynames2, aes(x = first_letter, y = first_letter_count)) +
geom_bar(stat = "identity") +
labs(x = "First Letter", y = "Count") +
ggtitle("Popularity of First Letters")

ggplot(data = babynames2, aes(x = last_letter, y = last_letter_count)) +
geom_bar(stat = "identity") +
labs(x = "Last Letter", y = "Count") +
ggtitle("Popularity of Last Letters")

15.6 Non-English text
15.6.1 Encoding
-encoding = () add certain names of languages.
15.6.2 Letter variations
-Working in languages with accents poses a significant challenge when
determining the position of letters (e.g., with str_length() and
str_sub())
-Note that a comparison of these strings with == interprets these
strings as different, while the handy str_equal() function in stringr
recognizes that both have the same appearance
-locale = can help adapt different languages’ unique formats.
16 Regular expressions
16.1 Introduction
library(tidyverse)
library(babynames)
16.3 Key functions
16.3.1 Detect matches
-str_detect() returns a logical vector that is TRUE if the pattern
matches an element of the character vector and FALSE otherwise
16.3.2 Count matches
-str_count()tells you how many matches there are in each string.
- str_to_lower() convert words to lower case.
16.3.3 Replace values
-str_replace() replaces the first match, and as the name suggests,
str_replace_all() replaces all matches
-str_remove() and str_remove_all() are handy shortcuts for
str_replace(x, pattern, ““)
16.3 Exercises (warn)
1.What baby name has the most vowels? What name has the highest
proportion of vowels? (Hint: what is the denominator?)
#Baby name with the most vowels
babynames |>
mutate(vowel_count = str_count(name, "[aeiouAEIOU]")) |>
filter(vowel_count == max(vowel_count)) |>
distinct(name)
## # A tibble: 2 × 1
## name
## <chr>
## 1 Mariaguadalupe
## 2 Mariadelrosario
#Baby name has the highest proportion of vowels
babynames |>
mutate(vowel_count = str_count(name, "[aeiouAEIOU]"))|>
mutate(vowel_proportion = vowel_count / nchar(name)) |>
filter(vowel_proportion == max(vowel_proportion)) |>
select(name, vowel_proportion)
## # A tibble: 110 × 2
## name vowel_proportion
## <chr> <dbl>
## 1 Eua 1
## 2 Eua 1
## 3 Eua 1
## 4 Eua 1
## 5 Ea 1
## 6 Ai 1
## 7 Ai 1
## 8 Ai 1
## 9 Ia 1
## 10 Ai 1
## # ℹ 100 more rows
2.Replace all forward slashes in “a/b/c/d/e” with backslashes. What
happens if you attempt to undo the transformation by replacing all
backslashes with forward slashes? (We’ll discuss the problem very
soon.)
original<- "F/-/1/5/E"
replaced<- gsub("/", "\\", original)
undo_string <- gsub("\\\\", "/", replaced)
replaced
## [1] "F-15E"
undo_string
## [1] "F-15E"
-Nothing changes when attempted to undo the transformation by
replacing all backslashes with forward slashes.
3.Implement a simple version of str_to_lower() using
str_replace_all().
replacements <- c(
"A" = "a", "B" = "b", "C" = "c", "D" = "d", "E" = "e",
"F" = "f", "G" = "g", "H" = "h", "I" = "i", "J" = "j",
"K" = "k", "L" = "l", "M" = "m", "N" = "n", "O" = "o",
"P" = "p", "Q" = "q", "R" = "r", "S" = "s", "T" = "t",
"U" = "u", "V" = "v", "W" = "w", "X" = "x", "Y" = "y",
"Z" = "z"
)
lower_words <- str_replace_all(words, pattern = replacements)
head(lower_words)
## [1] "a" "able" "about" "absolute" "accept" "account"
4.Create a regular expression that will match telephone numbers as
commonly written in your country.
x <- c("13562475567")
str_view(x, "\\d{3}-\\d{4}-\\d{4}")
16.4 Pattern details
16.4.1 Escaping
-We use strings to represent regular expressions, and is also used
as an escape symbol in strings. So to create the regular expression . we
need the string “\.”
16.4.2 Anchors
-Anchor the regular expression using ^ to match the start or $ to
match the end
-To force a regular expression to match only the full string, anchor
it with both ^ and $
-Match the boundary between words (i.e. the start or end of a word)
with
16.4.3 Character classes
-There are many pairs for characters(cannot remember them all by
now).
16.4.4 Quantifiers
-{n} matches exactly n times.
-{n,} matches at least n times.
-{n,m} matches between n and m times.
16.4.5 Operator precedence and parentheses
-quantifiers have high precedence and alternation has low precedence
which means that ab+ is equivalent to a(b+), and ^a|b$ is equivalent to
(^a)|(b$). ### 16.4.6 Grouping and capturing -\1 refers to the match
contained in the first parenthesis, \2 in the second parenthesis, and so
on.
16.4.7 Exercises
1.How would you match the literal string “’? How about”\(^\)“?
For “‘: "\’\\, \’ to express’, \\ to express
For”\(^\)“: \\(\\^\\\)
2.Explain why each of these patterns don’t match a : “",”\“,”\".
This is because the backslash character is a special character in
regular expressions and needs to be escaped to be treated as a literal
character.
For “": treat backslash as an escape character rather than a literal
backslash
For “\”: the backslash character is not escaped, so the regular
expression engine interprets it as an escape character and expects
another character to follow it. But there is no following.
For “\": The first backslash”" is used to escape the second backslash
“". This means the pattern is looking for a literal backslash. But”\”
xpects another character to follow it, while there is no following.
3.Given the corpus of common words in stringr::words, create regular
expressions that find all words that:
a.Start with “y”. (“^y”)
b.Don’t start with “y”. (“”)
c.End with “x”. (“x$”)
d.Are exactly three letters long. (Don’t cheat by using
str_length()!) (“{3}$”)
e.Have seven letters or more. (“{7,}$”)
f.Contain a vowel-consonant pair. (“[aeiou][^aeiou]”)
g.Contain at least two vowel-consonant pairs in a row.
(“[aeiou][^aeiou][aeiou][^aeiou]”)
h.Only consist of repeated vowel-consonant pairs.
(“^(?:[aeiou][^aeiou])+$”)
4.Create 11 regular expressions that match the British or American
spellings for each of the following words: airplane/aeroplane,
aluminum/aluminium, analog/analogue, ass/arse, center/centre,
defense/defence, donut/doughnut, gray/grey, modeling/modelling,
skeptic/sceptic, summarize/summarise. Try and make the shortest possible
regex!
a(ero)?plane
alumin(ium|um)
analog(ue)?
ar?se
cent(re|er)
defen(s|c)e
d(ough)?nut
gr(a|e)y
model(ling)?
ske(ptic|ptical)?
summar(ize|ise)
5.Switch the first and last letters in words. Which of those strings
are still words?
switched <- str_replace(words, "^(.)(.*)(.)$", "\\3\\2\\1")
words[words %in% switched]
## [1] "a" "america" "area" "dad" "dead"
## [6] "deal" "dear" "depend" "dog" "educate"
## [11] "else" "encourage" "engine" "europe" "evidence"
## [16] "example" "excuse" "exercise" "expense" "experience"
## [21] "eye" "god" "health" "high" "knock"
## [26] "lead" "level" "local" "nation" "no"
## [31] "non" "on" "rather" "read" "refer"
## [36] "remember" "serious" "stairs" "test" "tonight"
## [41] "transport" "treat" "trust" "window" "yesterday"
6.Describe in words what these regular expressions match: (read
carefully to see if each entry is a regular expression or a string that
defines a regular expression.)
^.*$ This expression matches any string or line that contains any
character, even it is a line or empty.
“\{.+\}” This regular expression matches a string that contains a
pair of curly braces (i.e., { and }) with one or more characters in
between.
-- This regular expression matches a date format in the form of
“YYYY-MM-DD,” where epresents a digit (0-9).
“\\{4}” This regular expression matches the literal string “{4}”
within double quotes. It looks for the exact sequence of characters,
including the escape character .
...... This regular expression matches a string containing three
periods (dots), separated by any character.
(.)\1\1 This regular expression matches any character followed by
two identical characters. The (.) captures any character, and \1\1
checks if the next two characters are the same as the first character
captured.
“(..)\1” his regular expression matches a string enclosed in
double quotes that consists of two identical characters followed by
another two identical characters. It captures the first two characters
and checks if the next two are the same.
7.Solve the beginner regexp crosswords at https://regexcrossword.com/challenges/beginner.
16.5 Pattern control
16.5.1 Regex flags
The most useful flag is probably ignore_case = TRUE because it
allows characters to match either their uppercase or lowercase
forms
dotall = TRUE lets . match everything, including
multiline = TRUE makes ^ and $ match the start and end of each
line rather than the start and end of the complete string
-comments = TRUE tweaks the pattern language to ignore spaces and new
lines, as well as everything after #. This allows you to use comments
and whitespace to make complex regular expressions more
understandable
16.5.2 Fixed matches
-You can opt-out of the regular expression rules by using fixed()
-fixed() also gives you the ability to ignore case
-If you’re working with non-English text, you will probably want
coll() instead of fixed(), as it implements the full rules for
capitalization as used by the locale you specify.
16.6 Practice
16.6.2 Boolean operations
Imagine we want to find words that only contain consonants. One
technique is to create a character class that contains all letters
except for the vowels ([^aeiou]), then allow that to match any number of
letters ([^aeiou]+), then force it to match the whole string by
anchoring to the beginning and the end (+$)
But you can make this problem a bit easier by flipping the
problem around. Instead of looking for words that contain only
consonants, we could look for words that don’t contain any
vowels
-If you get stuck trying to create a single regexp that solves your
problem, take a step back and think if you could break the problem down
into smaller pieces, solving each challenge before moving onto the next
one.
16.6.3 Creating a pattern with code
-create the pattern from the vector using str_c() and
str_flatten()
-whenever you create patterns from existing strings it’s wise to run
them through str_escape() to ensure they match literally
16.6.4 Exercises (Q)
1.For each of the following challenges, try solving it by using both
a single regular expression, and a combination of multiple str_detect()
calls.
a.Find all words that start or end with x.
-str_extract_all(something, “\b\wx\w\b”)[[1]]
b.Find all words that start with a vowel and end with a consonant.
-str_detect(something, “^x|[^x]$”) pattern <-
“\b[aeiouAEIOU][a-zA-Z]*[^aeiouAEIOU\\W]\b”
c.Are there any words that contain at least one of each different
vowel? pattern <-
“\b(?=.a)(?=.e)(?=.i)(?=.o)(?=.*u)\w+\b”
2.Construct patterns to find evidence for and against the rule “i
before e except after c”? -pattern_for_ie_after_c <-
“\b\wcie\w\b” -pattern_for_cei <-
“\b\w[^c]cei\w\b”
3.colors() contains a number of modifiers like “lightgray” and
“darkblue”. How could you automatically identify these modifiers? (Think
about how you might detect and then removed the colors that are
modified).
4.Create a regular expression that finds any base R dataset. You can
get a list of these datasets via a special use of the data() function:
data(package = “datasets”)$results[, “Item”]. Note that a number of old
datasets are individual vectors; these contain the name of the grouping
“data frame” in parentheses, so you’ll need to strip those off.
-base_datasets <- data(package = “datasets”)$results[, “Item”]
-regex_pattern <- “^(\w+)(\s\(.\))?$”
-matched_datasets <- character()
16.7 Regular expressions in other places
16.7.1 tidyverse
-There are three other particularly useful places where you might
want to use a regular expressions
-matches(pattern) will select all variables whose name matches the
supplied pattern.
-pivot_longer()’s names_pattern argument takes a vector of regular
expressions, just like separate_wider_regex(). It’s useful when
extracting data out of variable names with a complex structure
-The delim argument in separate_longer_delim() and
separate_wider_delim() usually matches a fixed string, but you can use
regex() to make it match a pattern.
16.7.2 Base R
-apropos(pattern) searches all objects available from the global
environment that match the given pattern.
17 Factors
17.2 Factor basics
-Create a list of the valid levels, and then create a factor
following these valid levels.
-If you omit the levels, they’ll be taken from the data in
alphabetical order
-Sorting alphabetically is slightly risky because not every computer
will sort strings in the same way. So forcats::fct() orders by first
appearance
-If you ever need to access the set of valid levels directly, you can
do so with levels()
17.3 General Social Survey
1.Explore the distribution of rincome (reported income). What makes
the default bar chart hard to understand? How could you improve the
plot?
ggplot(gss_cat, aes(rincome)) +
geom_bar() +
scale_x_discrete(drop = FALSE)
The default bar chart’s x-axis is unreadable for overlapping labels.
#Switch around and do scale_x_discrete
ggplot(gss_cat, aes(rincome)) +
geom_bar() +
scale_x_discrete(drop = FALSE) +
coord_flip()
2.What is the most common relig in this survey? What’s the most common
partyid?
#Most common relig
gss_cat %>%
count(relig) %>%
arrange(-n) %>%
head(3)
## # A tibble: 3 × 2
## relig n
## <fct> <int>
## 1 Protestant 10846
## 2 Catholic 5124
## 3 None 3523
#Most common partyid
gss_cat %>%
count(partyid) %>%
arrange(-n) %>%
head(3)
## # A tibble: 3 × 2
## partyid n
## <fct> <int>
## 1 Independent 4119
## 2 Not str democrat 3690
## 3 Strong democrat 3490
3.Which relig does denom (denomination) apply to? How can you find
out with a table? How can you find out with a visualization?
#Which relig does denom (denomination) apply to
levels(gss_cat$denom)
## [1] "No answer" "Don't know" "No denomination"
## [4] "Other" "Episcopal" "Presbyterian-dk wh"
## [7] "Presbyterian, merged" "Other presbyterian" "United pres ch in us"
## [10] "Presbyterian c in us" "Lutheran-dk which" "Evangelical luth"
## [13] "Other lutheran" "Wi evan luth synod" "Lutheran-mo synod"
## [16] "Luth ch in america" "Am lutheran" "Methodist-dk which"
## [19] "Other methodist" "United methodist" "Afr meth ep zion"
## [22] "Afr meth episcopal" "Baptist-dk which" "Other baptists"
## [25] "Southern baptist" "Nat bapt conv usa" "Nat bapt conv of am"
## [28] "Am bapt ch in usa" "Am baptist asso" "Not applicable"
#How can you find out with a table
gss_cat %>%
filter(!denom %in% c("No answer", "Other", "Don't know", "Not applicable", "No denomination")) %>%
count(relig)
## # A tibble: 1 × 2
## relig n
## <fct> <int>
## 1 Protestant 7025
#How can you find out with a visualization
gss_cat %>%
count(relig, denom) %>%
ggplot(aes(x = relig, y = denom, size = n)) +
geom_point() +
theme(axis.text.x = element_text(angle = 90))

17.4 Modifying factor order
- There are some suspiciously high numbers in tvhours. Is the mean a
good summary?
gss_cat %>%
filter(!is.na(tvhours)) %>%
ggplot(aes(x = tvhours)) +
geom_histogram(binwidth = 1)
There are some outliners in tvhours. It is better to use median instead
of mean.
- For each factor in gss_cat identify whether the order of the levels
is arbitrary or principled.
levels(gss_cat$marital)
## [1] "No answer" "Never married" "Separated" "Divorced"
## [5] "Widowed" "Married"
#marital is arbitrary
levels(gss_cat$race)
## [1] "Other" "Black" "White" "Not applicable"
#race is arbitrary
levels(gss_cat$rincome)
## [1] "No answer" "Don't know" "Refused" "$25000 or more"
## [5] "$20000 - 24999" "$15000 - 19999" "$10000 - 14999" "$8000 to 9999"
## [9] "$7000 to 7999" "$6000 to 6999" "$5000 to 5999" "$4000 to 4999"
## [13] "$3000 to 3999" "$1000 to 2999" "Lt $1000" "Not applicable"
#rincome is principled
levels(gss_cat$partyid)
## [1] "No answer" "Don't know" "Other party"
## [4] "Strong republican" "Not str republican" "Ind,near rep"
## [7] "Independent" "Ind,near dem" "Not str democrat"
## [10] "Strong democrat"
#partyid is arbitrary
levels(gss_cat$relig)
## [1] "No answer" "Don't know"
## [3] "Inter-nondenominational" "Native american"
## [5] "Christian" "Orthodox-christian"
## [7] "Moslem/islam" "Other eastern"
## [9] "Hinduism" "Buddhism"
## [11] "Other" "None"
## [13] "Jewish" "Catholic"
## [15] "Protestant" "Not applicable"
#relig is arbitrary
levels(gss_cat$denom)
## [1] "No answer" "Don't know" "No denomination"
## [4] "Other" "Episcopal" "Presbyterian-dk wh"
## [7] "Presbyterian, merged" "Other presbyterian" "United pres ch in us"
## [10] "Presbyterian c in us" "Lutheran-dk which" "Evangelical luth"
## [13] "Other lutheran" "Wi evan luth synod" "Lutheran-mo synod"
## [16] "Luth ch in america" "Am lutheran" "Methodist-dk which"
## [19] "Other methodist" "United methodist" "Afr meth ep zion"
## [22] "Afr meth episcopal" "Baptist-dk which" "Other baptists"
## [25] "Southern baptist" "Nat bapt conv usa" "Nat bapt conv of am"
## [28] "Am bapt ch in usa" "Am baptist asso" "Not applicable"
#denom is arbitrary
- Why did moving “Not applicable” to the front of the levels move it
to the bottom of the plot?
-Because it is determined by the factor level. Combine fct_relevel()
and “Not applicable” to the front will subsequently move “Not
applicable” to the bottom of the plot.
17.5 Modifying factor levels
-fct_recode() will leave the levels that aren’t explicitly mentioned
as is, and will warn you if you accidentally refer to a level that
doesn’t exist.
-If you want to collapse a lot of levels, fct_collapse() is a useful
variant of fct_recode()
-Sometimes you just want to lump together the small groups to make a
plot or table simpler. That’s the job of the fct_lump_*() family of
functions. fct_lump_lowfreq() is a simple starting point that
progressively lumps the smallest groups categories into “Other”, always
keeping “Other” as the smallest category.
- How have the proportions of people identifying as Democrat,
Republican, and Independent changed over time?
gss_cat %>%
mutate(partyid = fct_collapse(partyid,
other = c("No answer", "Don't know", "Other party"),
rep = c("Strong republican", "Not str republican"),
ind = c("Ind,near rep", "Independent", "Ind,near dem"),
dem = c("Not str democrat", "Strong democrat"))) |>
group_by(year, partyid) |>
summarize(n = n()) |>
ggplot(mapping = aes(x = year, y = n, color = fct_reorder2(partyid, year, n))) +
geom_point() +
geom_line() +
labs(color = 'Party',
x = 'Year',
y = 'Count')
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
The general trends of changing identity are similar within three groups,
but Independent has the largest volume of changes.
- How could you collapse rincome into a small set of categories?
gss_cat |>
mutate(rincome = fct_collapse(rincome,
"No answer" = c("No answer", "Don't know", "Refused"),
"$0 to 5000" = c("Lt $1000", "$1000 to 3000", "$3001 to 4000", "$4001 to 5000"),
"$5001 to 10000" = c("$5001 to 6000", "$6001 to 7000",
"$7001 to 8000", "$8001 to 10000"))) |>
count(rincome)
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `rincome = fct_collapse(...)`.
## Caused by warning:
## ! Unknown levels in `f`: $1000 to 3000, $3001 to 4000, $4001 to 5000, $5001 to 6000, $6001 to 7000, $7001 to 8000, $8001 to 10000
## # A tibble: 14 × 2
## rincome n
## <fct> <int>
## 1 No answer 1425
## 2 $25000 or more 7363
## 3 $20000 - 24999 1283
## 4 $15000 - 19999 1048
## 5 $10000 - 14999 1168
## 6 $8000 to 9999 340
## 7 $7000 to 7999 188
## 8 $6000 to 6999 215
## 9 $5000 to 5999 227
## 10 $4000 to 4999 226
## 11 $3000 to 3999 276
## 12 $1000 to 2999 395
## 13 $0 to 5000 286
## 14 Not applicable 7043
- Notice there are 9 groups (excluding other) in the fct_lump example
above. Why not 10? (Hint: type ?fct_lump, and find the default for the
argument other_level is “Other”.)
The fct_lump function was applied to a factor variable with 10
original levels. Since the default other_level value is “Other,” it
combines the less frequent levels into a single “Other” level. As a
result, WE have 9 groups (including the “Other” group) instead of the
original 10 distinct levels.
17.6 Ordered factors
-Ordered factors, created with ordered(), imply a strict ordering and
equal distance between levels: the first level is “less than” the second
level by the same amount that the second level is “less than” the third
level, and so on.
18 Dates and times
library(tidyverse)
library(nycflights13)
18.2 Creating date/times
18.2.1 During import
-If your CSV contains an ISO8601 date or date-time, you don’t need to
do anything; readr will automatically recognize it
-For other date-time formats, you’ll need to use col_types plus
col_date() or col_datetime() along with a date-time format.
-If you’re using %b or %B and working with non-English dates, you’ll
also need to provide a locale(). See the list of built-in languages in
date_names_langs(), or create your own with date_names(),
18.2.2 From strings
Identify the order in which year, month, and day appear in your
dates, then arrange “y”, “m”, and “d” in the same order. That gives you
the name of the lubridate function that will parse your date.
ymd() and friends create dates.
To create a date/time from this sort of input, use make_date()
for dates, or make_datetime() for date-times
18.2.4 From other types
-You may want to switch between a date-time and a date. That’s the
job of as_datetime() and as_date():
-Sometimes you’ll get date/times as numeric offsets from the “Unix
Epoch”, 1970-01-01. If the offset is in seconds, use as_datetime(); if
it’s in days, use as_date().
18.2.5 Exercises
- What happens if you parse a string that contains invalid dates?
ymd(c("2010-10-10", "bananas"))
## Warning: 1 failed to parse.
## [1] "2010-10-10" NA
-It will report a failed to parse.
- What does the tzone argument to today() do? Why is it
important?
-It is a character vector specifying which time zone you would like
the current time in. It is important since different time-zones can have
different dates, and tzone can help us specify the time.
- For each of the following date-times, show how you’d parse it using
a readr column specification and a lubridate function.
d1 <- “January 1, 2010” d2 <- “2015-Mar-07” d3 <-
“06-Jun-2017” d4 <- c(“August 19 (2015)”, “July 1 (2015)”) d5 <-
“12/30/14” # Dec 30, 2014 t1 <- “1705” t2 <- “11:15:10.12 PM”
library(lubridate)
d1 <- "January 1, 2010"
parse_date(d1, format = "%B %d, %Y")
## [1] "2010-01-01"
d2 <- "2015-Mar-07"
parse_date(d2, format = "%Y-%b-%d")
## [1] "2015-03-07"
d3 <- "06-Jun-2017"
parse_date(d3, format = "%d-%b-%Y")
## [1] "2017-06-06"
d4 <- c("August 19 (2015)", "July 1 (2015)")
parse_date(d4, format = "%B %d (%Y)")
## [1] "2015-08-19" "2015-07-01"
d5 <- "12/30/14"
parsed_date5 <- parse_date(d5, format = "%m/%d/%y")
t1 <- "1705"
parsed_time1 <- hms(t1)
t2 <- "11:15:10.12 PM"
parsed_time2 <- hms(paste(t2, "12"))
18.3 Date-time components
18.3.1 Getting components
-You can pull out individual parts of the date with the accessor
functions year(), month(), mday() (day of the month), yday() (day of the
year), wday() (day of the week), hour(), minute(), and second(). These
are effectively the opposites of make_datetime().
-For month() and wday() you can set label = TRUE to return the
abbreviated name of the month or day of the week. Set abbr = FALSE to
return the full name.
-We can use wday() to see that more flights depart during the week
than on the weekend
18.3.2 Rounding
-An alternative approach to plotting individual components is to
round the date to a nearby unit of time, with floor_date(),
round_date(), and ceiling_date(). Each function takes a vector of dates
to adjust and then the name of the unit to round down (floor), round up
(ceiling), or round to.
18.3.3 Modifying components
-Alternatively, rather than modifying an existing variable, you can
create a new date-time with update()
18.3.4 Exercises (Q)
1.How does the distribution of flight times within a day change over
the course of the year?
#Preparation
flights |>
select(year, month, day, hour, minute) |>
mutate(departure = make_datetime(year, month, day, hour, minute))
## # A tibble: 336,776 × 6
## year month day hour minute departure
## <int> <int> <int> <dbl> <dbl> <dttm>
## 1 2013 1 1 51 15 2013-01-03 03:15:00
## 2 2013 1 1 52 29 2013-01-03 04:29:00
## 3 2013 1 1 54 40 2013-01-03 06:40:00
## 4 2013 1 1 54 45 2013-01-03 06:45:00
## 5 2013 1 1 60 0 2013-01-03 12:00:00
## 6 2013 1 1 55 58 2013-01-03 07:58:00
## 7 2013 1 1 60 0 2013-01-03 12:00:00
## 8 2013 1 1 60 0 2013-01-03 12:00:00
## 9 2013 1 1 60 0 2013-01-03 12:00:00
## 10 2013 1 1 60 0 2013-01-03 12:00:00
## # ℹ 336,766 more rows
make_datetime_100 <- function(year, month, day, time) {
make_datetime(year, month, day, time %/% 100, time %% 100)
}
flights_dt <- flights |>
filter(!is.na(dep_time), !is.na(arr_time)) |>
mutate(
dep_time = as.integer(dep_time),
arr_time = as.integer(arr_time),
dep_time = make_datetime_100(year, month, day, dep_time),
arr_time = make_datetime_100(year, month, day, arr_time),
sched_dep_time = make_datetime_100(year, month, day, sched_dep_time),
sched_arr_time = make_datetime_100(year, month, day, sched_arr_time)
) |>
select(origin, dest, ends_with("delay"), ends_with("time"))
flights_dt
## # A tibble: 328,063 × 9
## origin dest dep_delay arr_delay dep_time sched_dep_time
## <chr> <chr> <dbl> <dbl> <dttm> <dttm>
## 1 EWR IAH 2 11 2013-01-01 00:05:00 2013-01-01 05:15:00
## 2 LGA IAH 4 20 2013-01-01 00:05:00 2013-01-01 05:29:00
## 3 JFK MIA 2 33 2013-01-01 00:05:00 2013-01-01 05:40:00
## 4 JFK BQN -1 -18 2013-01-01 00:05:00 2013-01-01 05:45:00
## 5 LGA ATL -6 -25 2013-01-01 00:05:00 2013-01-01 06:00:00
## 6 EWR ORD -4 12 2013-01-01 00:05:00 2013-01-01 05:58:00
## 7 EWR FLL -5 19 2013-01-01 00:05:00 2013-01-01 06:00:00
## 8 LGA IAD -3 -14 2013-01-01 00:05:00 2013-01-01 06:00:00
## 9 JFK MCO -3 -8 2013-01-01 00:05:00 2013-01-01 06:00:00
## 10 LGA ORD -2 8 2013-01-01 00:06:00 2013-01-01 06:00:00
## # ℹ 328,053 more rows
## # ℹ 3 more variables: arr_time <dttm>, sched_arr_time <dttm>, air_time <dbl>
#Plot
flights_dt |>
filter(!is.na(dep_time)) |>
mutate(dep_hour = update(dep_time, yday = 1)) |>
mutate(month = factor(month(dep_time))) |>
ggplot(aes(x=dep_hour, group = month, color = month))+
geom_freqpoly(binwidth = 200 * 200)

2.Compare dep_time, sched_dep_time and dep_delay. Are they
consistent? Explain your findings.
flights_dt |>
select(contains('dep')) |>
mutate(cal_delay = as.numeric(dep_time - sched_dep_time) / 60) |>
filter(dep_delay != cal_delay)
## # A tibble: 328,063 × 4
## dep_delay dep_time sched_dep_time cal_delay
## <dbl> <dttm> <dttm> <dbl>
## 1 2 2013-01-01 00:05:00 2013-01-01 05:15:00 -0.0861
## 2 4 2013-01-01 00:05:00 2013-01-01 05:29:00 -0.09
## 3 2 2013-01-01 00:05:00 2013-01-01 05:40:00 -0.0931
## 4 -1 2013-01-01 00:05:00 2013-01-01 05:45:00 -0.0944
## 5 -6 2013-01-01 00:05:00 2013-01-01 06:00:00 -0.0986
## 6 -4 2013-01-01 00:05:00 2013-01-01 05:58:00 -0.0981
## 7 -5 2013-01-01 00:05:00 2013-01-01 06:00:00 -0.0986
## 8 -3 2013-01-01 00:05:00 2013-01-01 06:00:00 -0.0986
## 9 -3 2013-01-01 00:05:00 2013-01-01 06:00:00 -0.0986
## 10 -2 2013-01-01 00:06:00 2013-01-01 06:00:00 -0.0983
## # ℹ 328,053 more rows
-They are not consistent. There are existing minor time difference
between scheduled departure time and departure time. Such difference can
be explained by the delay time, which is divert from scheduled time.
3.Compare air_time with the duration between the departure and
arrival. Explain your findings. (Hint: consider the location of the
airport.) (Why duration is ZERO?)
flights_dt |>
mutate(
flight_duration = as.numeric(arr_time - dep_time),
air_time_mins = air_time,
diff = flight_duration - air_time_mins
) |>
select(origin, dest, flight_duration, air_time_mins, diff)
## # A tibble: 328,063 × 5
## origin dest flight_duration air_time_mins diff
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 EWR IAH 180 227 -47
## 2 LGA IAH 180 227 -47
## 3 JFK MIA 240 160 80
## 4 JFK BQN 300 183 117
## 5 LGA ATL 180 116 64
## 6 EWR ORD 120 150 -30
## 7 EWR FLL 240 158 82
## 8 LGA IAD 120 53 67
## 9 JFK MCO 180 140 40
## 10 LGA ORD 60 138 -78
## # ℹ 328,053 more rows
4.How does the average delay time change over the course of a day?
Should you use dep_time or sched_dep_time? Why?
#dep_time
flights_dt |>
mutate(sched_dep_hour = hour(dep_time)) |>
group_by(dep_time) |>
summarise(dep_delay = mean(dep_delay)) |>
ggplot(aes(y = dep_delay, x = dep_time)) +
geom_point() +
geom_smooth()

#sched_dep_hour
flights_dt |>
mutate(sched_dep_hour = hour(sched_dep_time)) |>
group_by(sched_dep_hour) |>
summarise(dep_delay = mean(dep_delay)) |>
ggplot(aes(y = dep_delay, x = sched_dep_hour)) +
geom_point() +
geom_smooth()
-We use sched_dep_time since the dep_time will generate biased delays to
later in the day.
5.On what day of the week should you leave if you want to minimise
the chance of a delay?
flights_dt |>
mutate(weekday = wday(sched_dep_time, label = TRUE)) |>
group_by(weekday) |>
summarize(avg_dep_delay = mean(dep_delay, na.rm = TRUE),
avg_arr_delay = mean(arr_delay, na.rm = TRUE)) |>
gather(key = 'delay', value = 'minutes', 2:3) |>
ggplot() +
geom_col(mapping = aes(x = weekday, y = minutes, fill = delay),
position = 'dodge')
-Looks like Saturday is the best day for a flight.
6.What makes the distribution of diamonds\(carat and flights\)sched_dep_time
similar?
#The distribution of diamonds
diamonds |>
ggplot() +
geom_freqpoly(mapping = aes(x = carat), binwidth = .04)

#The distribution of flights
flights_dt |>
mutate(minutes = minute(sched_dep_time)) |>
ggplot() +
geom_freqpoly(mapping = aes(x = minutes), binwidth = 1)
-It might be that the human factor caused this similarity for the “nice”
dataset(?)
7.Confirm our hypothesis that the early departures of flights in
minutes 20-30 and 50-60 are caused by scheduled flights that leave
early. Hint: create a binary variable that tells you whether or not a
flight was delayed.
flights_dt |>
mutate(delayed = dep_delay > 0,
minutes = minute(sched_dep_time) %/% 10 * 10,
minutes = factor(minutes, levels = c(0,10,20,30,40,50),
labels = c('0 - 9 mins',
'10 - 19 mins',
'20 - 29 mins',
'30 - 39 mins',
'40 - 49 mins',
'50 - 50 mins'))) |>
group_by(minutes) |>
summarize(prop_early = 1 - mean(delayed, na.rm = TRUE)) |>
ggplot() +
geom_point(mapping = aes(x = minutes, y = prop_early)) +
labs(x = 'Scheduled departure (minutes)',
y = 'Proportion of early departures')

18.4 Time spans
18.4.1 Durations
-A difftime class object records a time span of seconds, minutes,
hours, days, or weeks.
18.4.2 Periods
-Periods are time spans but don’t have a fixed length in seconds,
instead they work with “human” times, like days and months.
18.4.3 Intervals
We can create an interval by writing start %–% end
18.4.4 Exercises (Q, 4)
1.Explain days(!overnight) and days(overnight) to someone who has
just started learning R. What is the key fact you need to know?
-Well, overnight itself is a boolean variable. So, days(!overnight)
means overnight is FALSE, and the flight arrive on the same day.
days(overnight) means overnight is TRUE, and will add one day to the
arr_time and sched_arr_time datetime.
2.Create a vector of dates giving the first day of every month in
2015. Create a vector of dates giving the first day of every month in
the current year.
year_2015 <- years(2015) + months(c(1:12)) + days(1)
year_2015
## [1] "2015y 1m 1d 0H 0M 0S" "2015y 2m 1d 0H 0M 0S" "2015y 3m 1d 0H 0M 0S"
## [4] "2015y 4m 1d 0H 0M 0S" "2015y 5m 1d 0H 0M 0S" "2015y 6m 1d 0H 0M 0S"
## [7] "2015y 7m 1d 0H 0M 0S" "2015y 8m 1d 0H 0M 0S" "2015y 9m 1d 0H 0M 0S"
## [10] "2015y 10m 1d 0H 0M 0S" "2015y 11m 1d 0H 0M 0S" "2015y 12m 1d 0H 0M 0S"
year_current <- years(year(today())) + months(c(1:12)) + days(1)
year_current
## [1] "2023y 1m 1d 0H 0M 0S" "2023y 2m 1d 0H 0M 0S" "2023y 3m 1d 0H 0M 0S"
## [4] "2023y 4m 1d 0H 0M 0S" "2023y 5m 1d 0H 0M 0S" "2023y 6m 1d 0H 0M 0S"
## [7] "2023y 7m 1d 0H 0M 0S" "2023y 8m 1d 0H 0M 0S" "2023y 9m 1d 0H 0M 0S"
## [10] "2023y 10m 1d 0H 0M 0S" "2023y 11m 1d 0H 0M 0S" "2023y 12m 1d 0H 0M 0S"
3.Write a function that given your birthday (as a date), returns how
old you are in years.
howold <- function(d) {
age <- today() - d
return(floor(age/dyears(1)))
}
howold(ymd(19980419))
## [1] 25
4.Why can’t (today() %–% (today() + years(1))) / months(1) work?
(?)
(today() %--% (today() + years(1))) / months(1)
## [1] 12
18.5 Time zones
-Use Sys.timezone() to find current time zone.
-OlsonNames() provides all time zones.
-Change time zones:1. Keep the instant in time the same, and change
how it’s displayed. Use this when the instant is correct, but you want a
more natural display. 2. Change the underlying instant in time. Use this
when you have an instant that has been labelled with the incorrect time
zone, and you need to fix it.
19 Missing values
19.2 Explicit missing values
19.2.1 Last observation carried forward
-When data is entered by hand, missing values sometimes indicate that
the value in the previous row has been repeated (or carried forward)
-We can fill in these missing values with tidyr::fill(). It works
like select(), taking a set of columns.
19.2.2 Fixed values
-Some times missing values represent some fixed and known value, most
commonly 0. You can use dplyr::coalesce() to replace them.
-If possible, handle this when reading in the data, for example, by
using the na argument to readr::read_csv(), e.g., read_csv(path, na =
“99”). If you discover the problem later, or your data source doesn’t
provide a way to handle it on read, you can use dplyr::na_if()
19.2.3 NaN
-a NaN (pronounced “nan”), or not a number; generally behaves just
like NA. In the rare case you need to distinguish an NA from a NaN, you
can use is.nan(x).
18.3 Implicit missing values
-An explicit missing value is the presence of an absence.
-An implicit missing value is the absence of a presence.
18.3.1 Pivoting
-Making data wider can make implicit missing values explicit because
every combination of the rows and new columns must have some value.
-By default, making data longer preserves explicit missing values,
but if they are structurally missing values that only exist because the
data is not tidy, you can drop them (make them implicit) by setting
values_drop_na = TRUE.
18.3.2 Complete
-tidyr::complete() allows you to generate explicit missing values by
providing a set of variables that define the combination of rows that
should exist.
-Usually call complete() with names of existing variables, filling in
the missing combinations. However, sometimes the individual variables
are themselves incomplete, so you can instead provide your own data.
-If the range of a variable is correct, but not all values are
present, you could use full_seq(x, 1) to generate all values from min(x)
to max(x) spaced out by 1.
18.3.3 Joins
-dplyr::anti_join(x, y) is a particularly useful tool here because it
selects only the rows in x that don’t have a match in y.
18.3.4 Exercises
Can you find any relationship between the carrier and the rows that
appear to be missing from planes?
missing_planes <- anti_join(flights, planes, by = "tailnum")
missing_planes |>
group_by(carrier) |>
summarize(missing_planes = n())
## # A tibble: 10 × 2
## carrier missing_planes
## <chr> <int>
## 1 9E 1044
## 2 AA 22558
## 3 B6 830
## 4 DL 110
## 5 F9 50
## 6 FL 187
## 7 MQ 25397
## 8 UA 1693
## 9 US 699
## 10 WN 38
It appears that AA and MQ have the most missing rows.
18.4 Factors and empty groups
-A final type of missingness is the empty group, a group that doesn’t
contain any observations, which can arise when working with factors.
-We can use .drop = FALSE to preserve all factor levels.
-All summary functions work with zero-length vectors, but they may
return results that are surprising at first glance.
-Sometimes a simpler approach is to perform the summary and then make
the implicit missings explicit with complete().
19 Joins
19.2 Keys
19.2.1 Primary and foreign keys
-A primary key is a variable or set of variables that uniquely
identifies each observation. When more than one variable is needed, the
key is called a compound key.
-A foreign key is a variable (or set of variables) that corresponds
to a primary key in another table.
19.2.2 Checking primary keys
-One way to do that is to count() the primary keys and look for
entries where n is greater than one.
-You should also check for missing values in your primary keys — if a
value is missing then it can’t identify an observation!
19.2.3 Surrogate keys
-Surrogate keys can be particularly useful when communicating to
other humans:
19.2.4 Exercises
- We forgot to draw the relationship between weather and airports in
Figure 19.1. What is the relationship and how should it appear in the
diagram?
library(nycflights13)
summary(weather)
## origin year month day
## Length:26115 Min. :2013 Min. : 1.000 Min. : 1.00
## Class :character 1st Qu.:2013 1st Qu.: 4.000 1st Qu.: 8.00
## Mode :character Median :2013 Median : 7.000 Median :16.00
## Mean :2013 Mean : 6.504 Mean :15.68
## 3rd Qu.:2013 3rd Qu.: 9.000 3rd Qu.:23.00
## Max. :2013 Max. :12.000 Max. :31.00
##
## hour temp dewp humid
## Min. : 0.00 Min. : 10.94 Min. :-9.94 Min. : 12.74
## 1st Qu.: 6.00 1st Qu.: 39.92 1st Qu.:26.06 1st Qu.: 47.05
## Median :11.00 Median : 55.40 Median :42.08 Median : 61.79
## Mean :11.49 Mean : 55.26 Mean :41.44 Mean : 62.53
## 3rd Qu.:17.00 3rd Qu.: 69.98 3rd Qu.:57.92 3rd Qu.: 78.79
## Max. :23.00 Max. :100.04 Max. :78.08 Max. :100.00
## NA's :1 NA's :1 NA's :1
## wind_dir wind_speed wind_gust precip
## Min. : 0.0 Min. : 0.000 Min. :16.11 Min. :0.000000
## 1st Qu.:120.0 1st Qu.: 6.905 1st Qu.:20.71 1st Qu.:0.000000
## Median :220.0 Median : 10.357 Median :24.17 Median :0.000000
## Mean :199.8 Mean : 10.518 Mean :25.49 Mean :0.004469
## 3rd Qu.:290.0 3rd Qu.: 13.809 3rd Qu.:28.77 3rd Qu.:0.000000
## Max. :360.0 Max. :1048.361 Max. :66.75 Max. :1.210000
## NA's :460 NA's :4 NA's :20778
## pressure visib time_hour
## Min. : 983.8 Min. : 0.000 Min. :2013-01-01 01:00:00.0
## 1st Qu.:1012.9 1st Qu.:10.000 1st Qu.:2013-04-01 21:30:00.0
## Median :1017.6 Median :10.000 Median :2013-07-01 14:00:00.0
## Mean :1017.9 Mean : 9.255 Mean :2013-07-01 18:26:37.7
## 3rd Qu.:1023.0 3rd Qu.:10.000 3rd Qu.:2013-09-30 13:00:00.0
## Max. :1042.1 Max. :10.000 Max. :2013-12-30 18:00:00.0
## NA's :2729
summary(airports)
## faa name lat lon
## Length:1458 Length:1458 Min. :19.72 Min. :-176.65
## Class :character Class :character 1st Qu.:34.26 1st Qu.:-119.19
## Mode :character Mode :character Median :40.09 Median : -94.66
## Mean :41.65 Mean :-103.39
## 3rd Qu.:45.07 3rd Qu.: -82.52
## Max. :72.27 Max. : 174.11
## alt tz dst tzone
## Min. : -54.00 Min. :-10.000 Length:1458 Length:1458
## 1st Qu.: 70.25 1st Qu.: -8.000 Class :character Class :character
## Median : 473.00 Median : -6.000 Mode :character Mode :character
## Mean :1001.42 Mean : -6.519
## 3rd Qu.:1062.50 3rd Qu.: -5.000
## Max. :9078.00 Max. : 8.000
weather only contains information for the three origin airports
in NYC. If it contained weather records for all airports in the USA,
what additional connection would it make to flights?
The year, month, day, hour, and origin variables almost form a
compound key for weather, but there’s one hour that has duplicate
observations. Can you figure out what’s special about that
hour?
We know that some days of the year are special and fewer people
than usual fly on them (e.g., Christmas eve and Christmas day). How
might you represent that data as a data frame? What would be the primary
key? How would it connect to the existing data frames?
Draw a diagram illustrating the connections between the Batting,
People, and Salaries data frames in the Lahman package. Draw another
diagram that shows the relationship between People, Managers,
AwardsManagers. How would you characterize the relationship between the
Batting, Pitching, and Fielding data frames?